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Preface 


Features of the book The writing of this book was prompted by the huge surge 
of research activities in machine learning and deep learning, and the crucial roles of 
convex optimization in the spotlight fields. This forms the motivation of this book, 
enabling three key features. 

The first feature of this book is the focus of optimization contents tailored for 
modern applications in machine learning and deep learning. Since the optimiza- 
tion field has contributed to many disciplines, ranging from statistics, dynamical 
systems & control, complexity theory to algorithms, there are numerous optimiza- 
tion books that have been published with great interest during the past decades. In 
addition, the optimization discipline has been applied to a widening array of con- 
texts, including science, engineering, economics, finance and management. Hence, 
the books contain a broad spectrum of contents to cover many applications in var- 
ious areas. On the other hand, this book focuses on a single yet prominent field: 
machine learning. Among the vast contents, we put a special emphasis on the con- 
cepts and theories concerning machine learning and deep learning. Moreover, we 
employ many toy examples to ease illustration of some important concepts. Exam- 
ples include historical and canonical instances, as well as the ones that arise in 
machine learning. 

Second, this book is written in a lecture style. A majority of optimization books 
deal with many mathematical concepts and theories together with numerous appli- 
cations in a wide variety of domains. So the concepts and relevant theories are sim- 
ply enumerated to present topics sequentially, following a dictionary style organi- 
zation. While the dictionary style eases search for targeted materials, it often lacks 
coherent stories that may help motivate readers. This books aims at motivating 
readers who are interested in machine learning inspired by optimization funda- 
mentals. So we intend to make an interesting storyline that well demonstrates the 
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role of fundamentals in the field. In order to establish a motivating storyline, this 
book adopts a lecture style organization. Each section serves as a note for a lecture 
spanning around 80 minutes, and an intimate connection is made across sections, 
centered around coherent themes and concepts. To make a smooth transition from 
one section to another, we feature two special paragraphs in each section: (i) the 
“recap” paragraph that summarizes what have been done so far, thereby motivat- 
ing the contents in the current section; and (ii) the “look ahead” paragraph that 
introduces upcoming contents by connecting with the past materials. 

The last feature of this book is the inclusion of many programming exer- 
cises via three prominent software languages: (i) Python; (ii) CVXPY; and (iii) 
TensorFlow. Since one of the key tools in convex optimization is an algorithm 
that requires computation on a computer, it is crucial to know how to implement 
algorithms using software tools. We employ Python as a major programming plat- 
form. To solve traditional convex optimization problems such as linear program, 
least squares, and semi-definite program, we utilize an easy-to-use and high-level 
language, CVXPY, running in Python. To implement machine learning and deep 
learning algorithms, we employ TensorFlow, one of the most popular deep learning 
frameworks. TensorFlow provides numerous powerful built-in functions that ease 
performing many important procedures in deep learning. One of the key benefits 
of TensorFlow is that it is fully integrated with Keras, the most high-level library 
with a focus on enabling fast user experimentation. Keras allows us to go from idea 
to implementation with very few steps. 


Structure of the book This book is made up of course materials that we devel- 
oped for the following two courses at KAIST: (i) EE523 Convex Optimization 
(offered in Spring 2019); and (ii) EE424 Introduction to Optimization (offered 
in Fall 2020 and 2021). It consists of three parts, each being comprised of many 
sections. Each section contains materials covered by a single lecture with the dura- 
tion of approximately 80 minutes. Each problem set (which served as a homework 
in the courses) is included every three or four sections. The contents for the three 
parts are summarized as below. 


I. Convex optimization basics (14 sections and 4 problem sets): A brief history of 
convex optimization; basic concepts on convex sets and convex functions, 
and the definition of convex optimization; gradient descent; linear pro- 
gram (LP), LP relaxation, least squares, quadratic program, second-order 
cone program, semi-definite program (SDP) and SDP relaxation; CVXPY 
implementation. 

Il. Duality (7 sections and 3 problem sets): The Lagrange function, the dual 
function and the dual problem; strong duality, KKT conditions, and the 
interior point method; weak duality and Lagrange relaxation. 
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HI. Machine learning applications (14 sections and 4 problem sets): Supervised 
learning and the role of optimization in logistic regression and deep learn- 
ing; backpropagation and its Python implementation; unsupervised learn- 
ing, Generative Adversarial Networks (GANs), Wasserstein GAN, and the 
role of LP and duality theories; fair machine learning and the role of the 
regularization technique and the KKT conditions; TensorFlow implemen- 
tation of a deep learning classifier, GANs, Wasserstein GAN and a fair 
machine learning algorithm. 


At the end, we offer three appendices for brief tutorials of the employed pro- 
gramming languages (Python, CVXPY and TensorFlow). We also provide a list of 
references that are related to the contents discussed in the book. But we do not 
explain details, since we do not aim to exhaust the immense research literature. 


How to use this book This book is written as a textbook for a senior-level 
undergraduate course, yet it is also suitable for a first-year graduate course. The 
expected background is solid undergraduate courses in linear algebra and probabil- 
ity, together with basic familiarity with Python. 

For students and interested readers, we provide some guidelines: 


1. Study one section per day and two sections per week: Since each section is 
designed for a single lecture and two lectures are normal per week in a course 
offering, we recommend this way of reading. 

2. Go through all the contents in Parts I and IT: One of the most important con- 
cepts in optimization is duality. So if you are familiar with convex optimiza- 
tion basics, then it is okay to directly dive into Part II without exploring 
Part I. However, if it is not the case, we recommend you to go through all 
the contents in Parts I and II in a sequential manner. A motivating storyline is 
made across sections, and proper exercise problems are placed adequately in 
between. We believe this way of reading maximizes your motivation, interest 
and understanding on the materials. 

3. Explore Part III in part depending on your interest: Since Part III is dedicated 
to applications, you may want to read them in part. Nonetheless, we made 
a logical storyline assuming that every section is read sequentially. One of 
the key features in Part III is TensorFlow implementation. You may be able 
to implement all the covered algorithms, consulting with a guideline in the 
main body together with skeleton codes offered in problem sets and appen- 
dices. 

4. Solve four to five basic problems in each problem set: Around 90 problems 
(more than 200 subproblems) are provided. Most of them elaborate on con- 
cepts discussed in the main text. The exercises range from basics on linear 
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algebra and probability, relatively straightforward derivations of results in the 
main text, in-depth exploration on non-trivial concepts not fully explained 
in the main text, and to programming implementation via Python, CVXPY 
or TensorFlow. All of the problems are tightly coupled with the storyline 
established. So working on at least some of the problems is essential in under- 
standing the materials. 


In the course offerings at KAIST, we have been able to cover most of the materials 


in Parts I and II, yet only two to three applications in Part III. Depending on the 


background and interest of students, and time availability, one can envision several 


other ways to structure a course around this book. Examples include: 


1. Semester-based course (24—26 lectures): Cover all the sections in Parts I and 


II, and two to three applications in Part III, e.g., (i) supervised learning and 
GANSs (or Wasserstein GAN), or (ii) supervised learning and fair machine 
learning. 


. Quarter-based course (18—20 lectures): Cover almost all the materials in Parts I 


and II, except some topics like LP relaxation, simplex algorithm, least squares 
for computed tomography, SDP relaxation, and the proof of strong duality 
theorem. Investigate two applications picked up from Part III. 


3. A graduate course for students with convex optimization basics: Briefly review 


the contents in Part I spending around four to six lectures. Cover all the 
materials in Parts II and III. 


Programming exercises may be covered through homeworks to save time. 


DOE: 10.1561/9781638280538.ch1 


Chapter 1 


Convex Optimization Basics 


1.1 Overview of the Book 


Outline In this section, we will cover two basic stuffs. The first is logistics of this 
book. We will explain details as to how the book is organized and will proceed. The 
second thing to cover is a brief overview to this book. We are going to explore a 
story of how optimization was developed, as well as what we will cover throughout 


this book. 


Prerequisite The key prerequisite for this book is to have a good background 
in linear algebra. In terms of a course, this means that you should have taken an 
introductory-level course on linear algebra. The course is usually offered in the 
Department of Mathematics, e.g., MAS109 in KAIST. Some of you might take 
a different yet equivalent course from other departments. This is also okay. Tak- 
ing a somewhat advanced-level course (e.g., MAS212 Linear Algebra in KAIST) is 
optional although it is recommended. If you feel uncomfortable although you took 
the relevant course(s), you may want to consult with some exercise problems that 
we will put in proper places while preceding the book. 

Another prerequisite for the book is a familiarity with the concept on proba- 
bility. In terms of a course, this means that you are expected to be comfortable 
with the contents dealt in an undergraduate-level probability course, e.g., EE210 in 
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KAIST. Taking an advanced course like Random Processes (e.g., EE528 in KAIST) 
is optional. This is not a strong prerequisite. In fact, the optimization concept itself 
has nothing to do with probability. But some problems (especially the ones that 
arise in machine learning and deep learning, and we will also touch upon in this 
book) deal with some quantities which are random and therefore are described with 
probability distributions. This is the only reason that understanding the probability 
is needed. Hence, reviewing any easy-level probability book that you choose may 
suffice. 

There must be a reason as to why linear algebra is crucial for this book. The 
reason is related to the definition of optimization. A somewhat casual definition 
of optimization is to make the best choice among many candidates or alternatives. 
A more math-oriented definition of optimization which we will rely upon is to 
choose an optimization variable (or a decision variable) so as to minimize or maxi- 
mize a certain quantity of interest possibly given some constraint(s). Here the opti- 
mization variable and the certain quantity are the ones that relate the optimization 
to linear algebra. In many situations, the optimization variables are multiple real 
values which can be succinctly represented as a vector (just a collection of num- 
bers). Also the certain quantity (which is a function of the optimization variables) 
can be represented as a function that involves matrix-vector multiplication and/or 
matrix-matrix multiplication, which are basic operations in linear algebra. In fact, 
many operations and techniques in linear algebra help formulate an optimization 
problem in a very succinct manner, and therefore serve to theorize the optimiza- 
tion field. This is the very reason that this book requires a good understanding and 
manipulation techniques on linear algebra. Some of you may not be well trained 
with expressing an interested quantity with vector/matrix forms. Please don't be 
offended. You will have lots of chances to be trained via some examples that will be 
covered throughout the book, particularly in many problem sets. Whenever some 
advanced techniques are needed, we will provide detailed explanations and/or rel- 
evant exercise problems which serve you to understand the required techniques. 


Problem sets Each problem set is offered per three or four sections. So there 
would be 11 problem sets in total. We encourage you to cooperate with your col- 
leagues in solving the problem sets. The problem sets are vehicles for learning, 
and whatever maximizes learning for you is desirable. This usually includes dis- 
cussion, teaching of others and learning from others. Solutions will be available 
only to instructors upon request. Some problems may require programming tools 
like Python, CVXPY and TensorFlow. We will use Jupyter notebook. Please refer 
to the installation guide provided in Appendix A.1. Or you can consult with: 


https://jupyter.readthedocs.io/en/latest/install.ptm| 


Overview of the Book 3 


We also provide tutorials for the programming tools in appendices: (i) Appendix A 
for Python; (ii) Appendix B for CVXPY; and (iii) Appendix C for TensorFlow. 


Optimization Let us investigate how the theory of optimization was developed 
in what contexts. Based on this, we will present detailed topics that we will learn 
about throughout the book. 

Let us start with a story of how the theory of optimization was developed. What 
is optimization? As mentioned earlier, a casual informal definition of optimization 
is to make the best choice out of possible candidates. It comes up often in our daily 
life. For example, we may want to figure out a scheduling strategy for airplanes 
so that the total waiting time is minimized under some constraints, e.g., a certain 
airplane with emergency should take off no later than a specific time. Or family 
members may want to choose a restaurant to visit for dinner, so as to maximize the 
happiness of the members (if it can be quantified) given a distance constraint, e.g., 
a chosen restaurant should be within a few kilometers. 

A mathematical definition of optimization that we are interested in here is to 
choose an optimization variable that minimizes (or maximizes) a certain quantity 
of interest possibly given some constraints. We aim at learning a theory concerning 
such a formal definition. Specifically we are interested in learning a mathematical 
theory of optimization which has been extensively developed and explored for a few 
past centuries. In fact, the birth of the theory traced back to an astronomy problem 
in the 1800s (Serio et al., 2002). So let us first talk about the problem so that you 
can readily figure out how the theory was developed. 


An astronomy problem in the early 1800s (Serio et al., 2002) In the 
early 1800s, astronomers discovered a new planetoid (or called dwarf planet), which 
was later named Ceres. See Fig. 1.1. Giuseppe Piazzi is the first astronomer who 
discovered the planetoid. At that time, he wished to figure out an orbit of Ceres. 


Figure 1.1. A new planetoid, named Ceres, discovered in the early 1800s. 
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Carl Friedrich Gauss (1777 ~ 1855) 


Figure 1.2. Carl Friedrich Gauss, a German mathematician, is the father of the optimiza- 
tion field. In fact, he is well-known as an inventor of the Gaussian distribution (one of the 
very famous and useful distributions in probability and statistics) as well as the Gaussian 
elimination (an efficient method which allows us to do matrix inversion or to solve linear 
equations). 


So as an effort, he made 19 observations (of its locations) over 42 days. However, 
something happened in locating the trajectory. The object was lost due to the glare 
of the Sun. So many astronomers wanted to predict the hidden trajectory only from 
the partial 19 observations. 

One interesting trial was made by a young German mathematician, named Carl 
Friedrich Gauss (1777 ~ 1855) (Gauss, 1887). See Fig. 1.2 for his portrait. He 
made a specific yet smart approach to figure out the trajectory successfully. In the 
process, he could develop a mathematical problem that later laid the foundation of 
the optimization field. 


Gauss’s approach (Gauss, 1887) Here is what Gauss did. See Fig. 1.3 for illus- 
tration. First of all, he gathered all the observations (each pointing to a location of 
Ceres measured at a certain time) scattered around the Sun. Let b; € RÌ indicate a 
coordinate of the location of the zth observation where 7 € {1,2,...,m}. Here m 
denotes the total number of observations that could be up to 19 in the astronomy 
problem context. Remember that Piazzi made 19 observations. 

Next he fixed an arbitrary number of observation points, say two points, marked 
in the hallowed star sign in the figure. You may wonder why two. This will be 
explained in detail soon. He then drew an orbit that crosses the two fixed points. 
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Figure 1.3. Gauss’s approach for searching the orbit of Ceres. 


Actually it was well studied in the astronomy field that an orbit can be fully rep- 
resented with six parameters. So the orbits that cross the two fixed points can be 
represented with four parameters. Depending on the choice of the free parameters, 
there are many ways to draw such orbits. Let’s denote by x € R a vector that 
stacks up the four parameters. Here we see that the dimension of x is six minus the 
number of observation points that we fixed earlier. Increasing the number of fixed 
points, a drawn orbit would more fit into the fixed points, and this may incur the 
risk of not well fitting into other observations. On the other hand, without fixing 
any points, we need to figure out six parameters and this may be challenging only 
with 19 observations. As a number that well balances these two issues, Gauss chose 
the number two. 

In terms of the vector x € R*, Gauss could represent a point on the orbit which 
is the closest to each 6;. He could approximate the point as a vector that comprises 


R*ź indicates 


linear combinations of the components of x, i.e., Ajx. Here Aj € 
a certain matrix that relates ; to the nearest point on the orbit. He then believed 
that if the orbit were the ground truth that we wish to figure out, then the distance 
to b; reflected in ||A;x — 6;||, must be only within a location-measurement error. 


Here ||x|] denotes the Euclidean norm (or called the €2 norm), defined as ||x|| := 


x? +-+ x3 where d is the dimension of x. This motivated him to formulate 


the following optimization problem: 


m 
min X ||Ajx — b;ll?. (1.1) 
xeR4 2d 
Since we fixed the two points that an orbit must pass, ||A;x — 0;|| = 0 for the fixed 


points. 


6 Convex Optimization Basics 


Gauss then observed that X% ||A;x — bill? can be simplified as: 
i=l p 


m A\x = by 5 
$ Ax - bil? = 
i=l Amx — bm 
4 (1.2) 
Ay by 
= : x— 
Am bm 
Let 
Ay bı 
A= ; € R3”X4 J= i € Ro”. (1.3) 
Am bm 
Then, the optimization problem can be re-written as: 
lAx — dll’. (1.4) 


min 
xeER4 


Least squares The problem (1.4) is the famous problem that is now known 
as least squares. Notice that the word “least” comes from “min” and “squares” is 
due to the square exponent placed above the Euclidean norm. You may wonder 
why Gauss employed the square as an exponent in the objective function. Why 
not other exponents like 1, 3 or 4? This was due to the mathematical beauty that 
Gauss was obsessed with. Notice that using other exponents like 1 or 3 or 4, one 
cannot do the beautiful simplification like (1.2). If you cannot see why, then check 
in Prob 11. 

The least-squares problem could open up the optimization field (since then, 
people have tried to theorize the field with passion) and also has played a significant 
role in the field. There are two reasons as to why the problem played such a big role. 
The first is that (1.4) has the beautiful closed-form solution: 


x* =(A’A)!ATO (1.5) 


where (-)” indicates a transpose of a matrix and (-)~! denotes a matrix inversion. 
We will later show why the solution is of the form (1.5). Please be patient until 
we get to the point. The second reason is that there are efficient algorithms and 
software that enable us to compute the solution involving matrix inverse. Even in 
the 1800s, there was an efficient matrix-inversion algorithm, based on the Gaussian 
elimination due to again Gauss. 
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Leonid Kantorovich (1912 ~ 1986) 


Figure 1.4. Leonid Kantorovich is a Soviet economist. He is known as the father of Linear 
Program (LP), one of the prominent optimization classes that we will study soon in depth. 


Since the development of the least squares, people tried to translate any problem 
of their interest to a least-squares problem. So a variety of translation techniques 
(that we will also study in this book) have been developed. However, people encoun- 
tered many situations in which such translation does not work. This challenge was 
expected. As you can easily image, the least-squares problem is just a single tiny 
class of the entire optimization problems that can be formulated in the world. 


A breakthrough by Kantorovich Unfortunately there was no significant 
progress on the optimization theory for more than a century. But another history 
was made in 1939 by a Soviet economist, named Leonid Kantorovich (Kantorovich, 
1960). See Fig. 1.4 for his portrait. 

He made a breakthrough in the course of solving a military-related problem 
during World War II (sort of forced to do so by the Soviet Union government). 
The problem that he was trying to solve was to plan expenditures and returns of 
soldiers to minimize the entire cost of the Soviet Union Army as well as to maximize 
the losses imposed on the enemy. 

In the process, he could formulate an optimization problem now known as the 
very famous linear program (LP).' Simply put, the LP is an optimization problem 


1. The “program” (or “programming”) is a jargon frequently used in the field, which refers to an optimization 
problem. So the formal name of LP is a linear optimization problem. 
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in which the objective function and the functions that appear in constraints are 
all linear. We will delve into its formal definition later on. Unlike the least-squares 
problem, the LP has no closed form solution. But the good news is that Kantorovich 
could develop a very efficient algorithm that achieves the optimal solution. Actually 
having a closed form solution is not that important as long as we know how to get 
to the optimal solution. This achievement won him the Nobel Prize in Economics 
in 1975 (Kantorovich, 1989). 

The development of LP made people become very excited again, trying to 
translate an interested problem into a least-squares problem or LP. While many 
LP-translation techniques (that we will also study) have been developed, people 
encountered still many situations in which the translation is not doable. 


A class of tractable optimization problems Inspired by the development 
of LP, people tried to come up with a class of tractable optimization problems 
which can be solved reliably & efficiently. LP is one such instance. In a decade, 
another tractable problem, called quadratic program (QP), was developed (Frank 
and Wolfe, 1956). It is a generalized version, as it includes as special cases the least- 
squares problem and LP. See Fig. 1.5. 

In the 1990s, another problem, called second-order cone program (SOCP), 
was developed which subsumes as special cases all of the prior problems (Nes- 
terov and Nemirovskii, 1994). Around at the same time, a larger class of problem, 
called semi-definite program (SDP), was developed (Alizadeh, 1991; Nesterov and 
Nemirovskii, 1994). More and more tractable problems have been developed so far. 
It turns out that all of the tractable problems share the common property (concern- 
ing the word “convex”), and this property established the class of tractable prob- 
lems, named convex optimization, which we will focus mostly on throughout this 


book. 


Convex Optimization 


Least-squares Linear Program 
1800s (LP) 1939 


Quadratic Program (QP) 1956 


Second-Order Cone Program 
(SOCP) 1994 


Semi-Definite Program 
(SDP) 1994 


Figure 1.5. A class of tractable optimization problems: Convex optimization. 
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Book outline This book consists of three parts. In Part I, we will study the basic 
concepts and several mathematical definitions required to understand what convex 
optimization is as well as how to translate an interested problem into a convex prob- 
lem. We will then explore five instances of convex optimization problems: LP, least 
squares, QP, SOCP and SDP. We will focus on techniques which serve recognizing 
(and translating to) such problems. We will also study some prominent algorithms 
for solving the problems. In Part I, we will study one of the key theories in the 
optimization field, called duality. There are two types of dualities: (1) strong dual- 
ity; and (2) weak duality. The strong duality is quite useful for gaining algorithmic 
insights for convex problems. The weal duality helps dealing with difficult non- 
convex problems, by providing an approximated solution. In the last third part, we 
will explore applications that arise in machine learning: (i) supervised learning, one 
of the most popular machine learning methodologies; (ii) Generative Adversarial 
Networks (GANs), one of the breakthrough models for unsupervised learning; and 
(iii) fair classifiers, which is one of the trending topics in machine learning. 


10 Convex Optimization Basics 


1.2 Definition of Convex Optimization 


Recap In the last section, we figured out how the optimization theory was devel- 
oped. There were two breakthroughs in the history of optimization. The first was 
made by the famous Gauss. In the process of solving an astronomy problem of fig- 
uring out the orbit of Ceres (which many astronomers were trying to address in 
the 1800s), he could develop an optimization problem, which is now known as 
the least-squares problem. The beauty of the least-squares problem is two folded: 
(i) it has a closed form solution; and (ii) there is an algorithm which enables us to 
efficiently do matrix inversion required for computing the solution. The beauty of 
the problem opened up the optimization field and has played a significant role in 
the field. 

The second breakthrough was made by Leonid Kantorovich. In the process of 
solving a military-related problem, he could formulate a problem which is now 
known as linear program (LP). The good thing about LP is that there is an efficient 
algorithm which allows us to compute the optimal solution reliably and efficiently 
although the closed form solution is unknown. In other words, Kantorovich came 
up with the concept of tractable optimization problems which can be solved via 
an algorithm without the knowledge of the concrete form of the optimal solution. 
This motivated many followers to mimic his approach, thereby coming up with 
a class of tractable optimization problems: convex optimization. See the class of 
convex optimization problems in Fig. 1.6. 


Outline The goal of this section is to understand what convex optimization is. 
To this end, we will cover four stuffs. First we will study a standard mathematical 
formulation of optimization problems. It turns out the definition of convex opti- 
mization problems requires the knowledge of convex functions. But the definition of 
convex functions relies upon the concept of convex sets. So in the second part, we 


Convex Optimization 


Least-squares Linear Program 
1800s (LP) 1939 


Quadratic Program (QP) 1956 


Second-Order Cone Program 
(SOCP) 1994 


Semi-Definite Program 
(SDP) 1994 


Figure 1.6. A class of tractable optimization problems: Convex optimization. 
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will study what the convex set is. We will also investigate some important examples 
which help us to be familiar with the concept as well as which play an important 
role in defining convex optimization. Next we will study the definition of convex 
functions, building upon a couple of examples and crucial properties to be explored 
in depth. Using all of these, we will finally investigate a standard mathematical for- 
mulation of convex optimization problems. 


Optimization problem in standard form Let us start by recalling the defini- 
tion of optimization: Choosing an optimization variable that minimizes (or maxi- 
mizes) a certain quantity of interest possibly given constraints. We denote the opti- 
mization variable by x := [x,...,xg]’ € R? where d indicates the dimension 
of the variable. Denote the objective function (the certain quantity) by f(x) € R. 
Notice that the objective function should always be a real number, not a vector. 
There are two types of constraints: (i) inequality constraints; and (ii) equality con- 
straints. The inequality constraints are represented by the form like f(x) < c; where 
i e {1,..., m}. Here m indicates the number of the constraints. Without loss of 
generality (WLOG), the constant c; can be merged with f;(x) and hence the form 
can be simplified as: f;(x) < 0. Here the WLOG means that the general case can 
readily be covered with some proper modification to the simplified case of focus. 
That’s why here we say that there is no loss of generality although we focus on the 
simplified case. Similarly the equality constraints can be represented by: /;(x) = 0 
where i € {1,...,p} and p denotes the number of the equality constraints. 

Using these notations, one can write the standard form of optimization prob- 
lems as: 


min f (x) 

xeR7 

subject to f;(x) < 0,7 € {1,2,..., m}, (1.6) 
h(x) = 0,7 € {1,2,..., p}. 


WLOG, it suffices to consider the minimization problem, since the maximization 
problem can readily come by flipping the sign of f(x): min f (x) is equivalent to 
max —f (x). Here we have two conventions that allow us to simplify the above 


c 


form (1.6). First we use the colon “:” to indicate the “subject to”. Second, x € R7 


placed below min is often omitted since the role of x is clear enough from the 
context. Hence, the simpler form reads: 


minf (x): fix) < 0,2 € {1,2,..., m} 
hi) = 0,7 € {1,2,..., p}. 


12 Convex Optimization Basics 


Two more things to note. One is the optimal value, denoted by p* := min f(x). 
The other is the optimal solution, denoted by x* := arg min f (x). Or it is called the 
minimizer. Here “arg min” stands for “argues the one that minimizes”. 


Convex set Now what is convex optimization that we wish to figure out in this 
section? As mentioned in the beginning, to define this, we need to know about the 
concept of convex functions. But the definition of convex functions requires the 
knowledge of convex sets. So we will first study the definition of the convex set. 

A set S is said to be convex if and only if 


x,y E S => Axt+(—Aye S, Vie [0,1] (1.8) 
where the sign “V” means “for all”. 


Examples: Point, line, plane, line segment, ... To get a concrete feel about 
what the convex set means, let us explore several examples. The first simplest exam- 
ple is the set containing a single point. This is obviously a convex set, as any convex 
combination that lies in between x and y, represented as Ax + (1 — )y, is just the 
single point. 

The second simplest example is perhaps the set that contains a ine that lives in a 
2-dimensional ambient space. This is also convex because any convex combination 
of two points lying on a line should also lie on the line. Here let us investigate 
how to represent the convex set. This representation will help us to understand the 
concept of convex optimization later on. Notice that the line in a 2-dimensional 
space can be represented as: y = ax + b where a and b indicate the slope and 
y-intercept, respectively. Hence, one can represent the set as: 


S = {x := [xo]? 2 = am + di}. (1.9) 


Here we use the notations (x1, x2) instead of (x, y); also (41, 61) instead of (a, b). 
Using vector notations, one can define a := [~a], 1] T and b1 := b, which in turn 
simplifies the representation (1.9) as: 


S={x:a'x—b=0}. (1.10) 


The third example is the naive extension of the second example: a plane living 
in a 3-dimensional space. This is also obviously a convex set, as any combination of 
two points lying on a plane also lies on the plane. The representation of the convex 
set is exactly the same as (1.10), except that now the dimension of x and a are 3. 
Why? 

The fourth example is the one in which the dimension of an object of interest 
differs from d > 2 by 2. One such example is the set that contains a /ine living 
in a 3-dimensional space. This is also a convex set, since the object of interest is a 
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line. But the representation of such a convex set is different from that of (1.10). 
A line is actually the intersection of two planes in a 3-dimensional space. So the 
representation of the set should read: 


S = {x:a)x—b, = 0, a7 x — b = 0}. (1.11) 
Defining A := [4;, a2] T and 6 := [b], b2], one can simplify this as: 
S = {x : Ax — b = 0} (1.12) 


where the thick 0 indicates the all-zero vector [0, 0] T. For illustrative simplicity, we 
will use the normal 0 even to denote the all-zero vector, as it may be clear from the 
context. 

Looking carefully at these examples, one can see that the representation of a 
line, a plane or a hyperplane (a subspace whose dimension is one less than that 
of its ambient space) lying in a larger-dimensional ambient space reads the form 
like (1.12). Depending on the dimension of the matrix A € R?Xx4, S may refer 
to the set containing a line, a plane or a higher-dimensional plane. For instance, 
when d — p = 1, S refers to a line. When d — p = 2, S indicates a plane. The set 
represented by the form (1.12) is called an affine set. An affine function is a linear 
function that allows for having a bias constant term; the formal definition will be 
given later on. Since the set in (1.12) includes the affine function f(x) = Ax — b, 
it is called an affine set. 

Another example that we would like to mention is the line segment; see the left top 
in Fig. 1.7. Again this is obviously a convex set. On the other hand, a broken line, 
a line that is broken in the middle, is not convex, since some convex combination 
of two points in the broken line may fall into to somewhere in the broken place; 
see the right top in Fig. 1.7. 


More examples: Convex polygon, polyhedron, polytope, ... You may won- 
der if there are any other examples beyond point/line/plane. Of course, there 
are many. One object that you may be interested in is: a polygon living in a 


line segment broken line 
closed polygon boundary-only polygon 


Figure 1.7. Examples of convex sets (left two) and non-convex sets (right two). 
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2-dimensional space. In particular, the closed polygon in which points inside (and 
boundary at) the polygon are included in the set is a convex set. See the left bottom 
in Fig. 1.7 for illustration. On the other hand, the boundary-only polygon is not a 
convex set. See the right bottom in Fig. 1.7. 

As you may imagine, the representation of the closed-polygon convex set is dif- 
ferent from the form (1.12) of affine sets. The closed-polygon can actually be rep- 
resented as the intersection of half-planes. A half-plane is a planar region consisting 
of all points on one side of an infinite straight line, and no points on the other 
hand. It is represented by a/ x — b; < 0. Hence, the representation of such a set 
reads: 


S={x:Ax—b< 0} (1.13) 


where the inequality indicates a component-wise inequality. 

Similarly, a polyhedron living in a 3-dimensional ambient space (or a polytope 
living in an d-dimension space and hence difficult to visualize) is a convex set and 
can also be represented as the form like (1.13). 


Convex function We are now ready to define the convex function. A real-valued 
function f (x) is said to be convex if the following two conditions are satisfied: 


(i) the domain of the function f, denoted by domf (the set in which the input 
x of the function lies in), is a convex set; and 


(ii) for x,y € domf and / € [0, 1]: 


fx+(1— ay) < Af) +1 — AS). (1.14) 


Here you can see why we needed to know about the concept of the convex set. The 
concept appears while mentioning the first condition that domf should satisfy. 
Actually this “convex set” condition is required; otherwise, a problem occurs when 
it comes to stating the second key condition in (1.14), because the function in the 
left hand side cannot be defined. Notice that the input argument in the function is 
Ax+ (1 —A)y and this should be in domf (meaning that domf should be convex); 
otherwise, f(Ax + (1 — A)y) cannot be defined. 

If you think about some picture that reflects the second key condition (1.14), 
then you can readily get a feel about why the convex function should be defined in 
such a manner. The meaning of “convex” is “bowl-shaped”. So we can think about 
a bowl-shaped curve like the one as illustrated in Fig. 1.8. Consider two points, 
say x and y, and a A-weighted convex combination, Ax + (1 — 4)y. The function 
evaluated at Ax + (1 — A)y is on the bowl-shaped curve while the same-weighted 
convex combination 4f (x) + (1 — 4)f (y) of the two functions evaluated at x and 
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f(Aa + (1 — A)y) 


x Ter y 


Figure 1.8. A geometric intuition behind convex functions. 


yis above f (Ax + (1 — A)y). Hence, the key condition (1.14) comes naturally as a 
consequence of the bowl-shaped feature of the curve. 

There are tons of examples of convex functions and also many of these are the 
ones that you should be familiar with if you wish to be an expert in this field. Or at 
least you may want to know some of them in which the problems of your interest 
can be linked to. But exploring many of such examples may be too much now — it is 
just the beginning of the book, so that way you will be exhausted shortly. Thus we 
will here investigate only a couple of examples. One example of a convex function is: 


fœ) = - x > 0. (1.15) 


Here the function is indeed bowl-shaped, so it respects the second condition (1.14). 
Also domf = {x : x > 0} is a convex set, satisfying the first condition. Hence, the 
function is convex. 

Now what about a slightly different function: 


f(x) = k (1.16) 


Here the distinction is that domf is not explicitly defined. In this case, we should 
think about an ¿implicit constraint that x should satisfy. The implicit constraint is: 
x Æ 0, thus yielding: 


domf = (—0oo, 0) U (0, 00). 


Since domf is not convex, the function is not convex either. 

There is a way to handle this issue to make such a non-trivial function convex. 
The way is to make domf span the entire region (making it convex) while setting 
the function to some arbitrary quantities for newly added regions. For instance, we 
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can define the function as: 


1 x > 0; 


f= [> (1.17) 


+oo, x<0. 


Notice that now dom = (—00, 00) is a convex set. Also one can readily verify that 
the key condition (1.14) is satisfied. Hence, the function is convex. 

There is another function which is defined very similarly to the convex function. 
That is, a concave function. We say that f(x) is concave if —f (x) is convex. The 
geometric intuition says that the function of a bell shape is concave. Also a function 
is said to be affine (linear plus bias) if it is convex and concave. 

On a side note: Here you may wonder if there is a corresponding set with respect 
to (w.r.t.) concave functions, like a concave set. Unlike a convex set, people do 
not introduce any definition for a set regarding concave functions. Remember in 
the definition of convex functions that the convex set was employed only for the 
purpose of making f (4x + (1 — )y) definable at the convex combination. Hence, 
the same convex set suffices to define concave functions. 


Convex sets defined in terms of general convex functions Previously we 
investigated a bunch of examples of convex sets where only affine functions, Ax — 6 
(linear and bias-allowing functions), are introduced. There are many convex sets 
concerning general convex functions. Here we list a couple of such examples. 

One such example is: 


S = {x: f(x) < 0} (1.18) 


where f (x) is a convex function. Here is the proof that S is a convex set. Suppose 
x,y E€ S. Then, f(x) < 0 and f(y) < 0. This together with the convexity of f, 
reflected in the condition (1.14), gives: 


f(x + (1 — Ay) < 0, 


which in turn implies that Ax + (1 — A)y € S. This completes the proof. 
Another example is the intersection of such convex sets: 


S=S,N8) 
S= {x: fi) < 0}, 52 = ix: fA) < 0}. 


(1.19) 


Try the proof in Prob 1.4. Actually the intersection of arbitrary convex sets is also 
convex — check in Prob 1.4 as well. 
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Convex optimization problem in standard form We are now ready to define 
the convex optimization problem. It is an optimization problem which satisfies 
the following three: (i) The objective function is convex; (ii) The set induced by 
inequality constraints is convex; and (iii) The set induced by equality constraints 
is convex. So the standard form of the convex optimization problem reads (1.7) in 
which (i) f(x) is convex; (ii) f(x) is convex; and (iii) 4;(x) is affine. Notice that the 
set induced by affine equality constraints S = {x : Ax — b = 0} is a convex set as 
we studied earlier. 

On a side note: Notice that the standard form exhibits just a sufficient condition 
under which the set induced by inequality constraints is convex. There are indeed 
a bunch of convex sets which take the form like (1.18) yet having a non-convex 
function f (x). But it turns out that in many cases, convex sets can be represented 
as the form like (1.18) with a convex function f (x). This is one of the main reasons 
that people define convex optimization in such a manner. 


Look ahead There is another reason that the convex optimization problem is 
defined in such a manner. This is because the way of definition makes the prob- 
lem tractable. In the next section, we will provide an intuition as to why convex 
optimization is tractable. We will then start investigating one instance of convex 
optimization: Linear Program (LP). 
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1.3 Tractability of Convex Optimization and Gradient 
Descent 


Recap In the last section, we studied the concept of convex sets and convex func- 
tions to understand what convex optimization is. We then figured out that convex 
optimization is defined as: 


min f(x): fi(x) < 0,2 € {1,..., m}, 


(1.20) 
bx) = 0,7 € {1,...,p} 


where f (x) and fj(x)’s are convex and /;(x)’s are affine functions. There was a rea- 
son that many people have been interested in such convex optimization defined 
particularly as above. The reason is that the way of defining the problem makes the 
problem tractable. Here what it means by tractable is that the optimal solution can 
be achieved via an algorithm (often with the help of a computer) even if the closed 
form solution is unknown. 


Outline The main goal of this section is to understand why such a particular way 
of definition enables zractability. We will try to understand this by focusing on 
two cases: (i) unconstrained minimization; and (ii) constrained minimization. For 
the unconstrained case, we will present an explicit intuition as to why the convex 
objective function yields a tractable way of solving the problem. Specifically we will 
first derive a simple-looking necessary and sufficient condition for the optimality 
of a solution and then argue that there are efficient algorithms that allow us to 
satisfy the condition, thus obtaining the optimal solution. We will also study one 
very prominent algorithm, named gradient descent. For the constrained case, on 
the other hand, we will rely upon a well-known theory (to be studied in depth in 
Part II) to argue that there are also efficient algorithms that allow us to solve the 
problem. 

Another goal of this section is to give an overview to the contents regarding one 
instance of convex optimization problems: Linear Program (LP). This is what we 
will cover through a couple of upcoming sections. 


Assumption While investigating the two cases, we will assume that: (i) the objec- 
tive function f (x) is differentiable at every point x in domf; (ii) domf is open; and 
(iii) f(x) has a stationary point, i.e., there exists x* such that Vf(x*) = 0. The 
rationale behind this assumption is two folded. The first is that the assumption 
represents many practically-relevant scenarios. Second, the assumption allows us to 
easily explain an intuition behind the tractability of convex optimization. This will 
be clearer as this section progresses. 
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f(z) 


Tf e-a) + f(2*) 


x = 


Figure 1.9. 1st order condition of convex functions: f(x) > Vf (x*)! (e—x*) + f(x"), Vx € domf. 


Unconstrained minimization Let us start by investigating the unconstrained 
convex optimization: 


min f (x). 


Recall the meaning of convex: “bowl-shaped”. So one can think of a graph illus- 
trated in Fig. 1.9. This graph also respects our assumption that there exists a sta- 
tionary point. Here you can easily see that the slope of the objective function 
at the optimal point x* is 0, and also vice versa (meaning that the point with 
the slope being 0 is the optimal solution). This naturally guides us to conjecture 
that Vf(x*) = 0 is a sufficient and necessary condition in order for x* to be 
optimal: 


Vf (x") = 0 = > f(x) > f(x") Vx e domf. (1.21) 


It turns out this conjecture indeed holds. Here is the proof. 

Proof of the direct part (==>) in (1.21): To gain some insights, let us see a convex 
function f(x) in Fig. 1.9. Pick up a point (x*, f (x*)). Now consider a line that 
passes through the point (x*, f(x*)) with a slope Vf (x*) so that it is tangent to 
f(x). Then, the line should read: VET (x — x*) + f(x*). Here the picture 


suggests that the convex function f (x) is above (or touching) the line: 
f(x) > VE") (e — x*) + f(x") Vx € domf. (1.22) 


It turns out this is indeed the case, meaning that the condition (1.22) (together with 
dom/ being convex) holds for any x,y € domf if and only if f(x) is convex. This 
is one of the crucial properties of convex functions, called the “1st order condition 
of convex functions”. The proof of this is omitted here, but you will have a chance 
to prove this in Prob 1.5. This together with the hypothesis (Vf(x*) = 0) gives: 
f(x) = f(x*), Vx e domf. 
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Proof of the converse part (=) in (1.21): The converse proof relies upon the 
following fact: 


f(x) = f(x") Vx € domf = Vf (x*)" (x — x*) > 0 Vx € domf. 
(1.23) 


Let us adopt this for the time being. We will prove this fact once we finalize the 
converse proof. 

Suppose Vf(x*) # 0. Here the key thing to note is that there is no constraint 
on x, except that x € domf. So one can choose x such that x — x* points to an 
arbitrary direction. This implies that we can easily choose x such that 


Vi (x*)7 (x — x*) < 0. (1.24) 


This contradicts with the RHS of (1.23). Hence, it must be that Vf (x*) = 0. This 
completes the proof. 

Let us now prove (1.23) which we deferred proving earlier. 

Proof of (1.23): The proof idea is by contradiction. Suppose that there exists x 
(i.e., dx) € domf such that: 


Vi (x*)7 (x — x*) <0. (1.25) 


Consider a point: z(4) := Ax + (1 — A)x* where 2 e [0, 1]. Notice that z(4) € 
dom/, as the function f is convex and therefore its domain is a convex set. Here 
what we want to show is that for a very small A ~ 0, f(z(A)) < f(x*). This is 
because f(z(A)) < f(x*) contracts with the fact that x* is an optimal solution, 
thus leading to contradiction. To show this, we consider the following quantity: 


gÍ EO) = VFA)" Fe) 


O ve ea) @ — x") 


where (a) follows from a chain rule and (b) is due to the definition of z(A) := 
Ax + (1 — A)x*. Now evaluating both sides at À = 0, we get: 


a (z(A)) = Vf (x*)"(« —x*) <0 (1.26) 
dk P 


where the last inequality comes from our assumption (1.25). Here the derivative of 


f(z2(4)) being negative at A = 0 implies that f (z(4)) decreases with À and therefore: 
FEAD <f). (1.27) 
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This contradicts with the hypothesis f(x) > f(x*) Vx € domf. This completes 
the proof. 


Gradient descent So what we can conclude with respect to (w.r.t.) uncon- 
strained minimization is that: 


Vf (x*) = 0 is a sufficient and necessary condition for x* to be optimal. 
y P 


This suggests that it suffices to find a point such that (s.t.) its gradient is 0. But 
there are some issues in obtaining such a point. Two issues. One is that computing 
Vf (x) may not be that simple. The second is that analytically finding such a point 
may not be doable even if one can explicitly compute the gradient. However, there 
is a good news. The good news is that there are several algorithms which allow us to 
find such a point numerically without the knowledge of the closed form solution. 
One prominent algorithm that has been widely employed in a variety of fields is: 
gradient descent. 

Here is how the algorithm works. The gradient descent is an iterative algorithm. 
Suppose that at the ¢th iteration, we have an estimate of x*, say x. We then 
compute the gradient of the function evaluated at the estimate: Vf (x). Next we 
update the estimate along a direction being opposite to the direction of the gradient: 


xD e— xO — GOV F(x) (1.28) 


where a) > 0 indicates the learning rate (or called a step size) that usually decays 
likea = 5. 

If you think about it, this update rule makes an intuitive sense. Suppose x”) 
is placed right relative to the optimal point x*, as in the two-dimensional case? 
illustrated in Fig. 1.10. 

Then, we should move x” to the Zefi so that it becomes closer to x*. The update 
rule actually does this, as we subtract by a Vf (x). Notice that Vf (x®) points 
to the right direction given that x”) is placed right relative to x*. We repeat this 
procedure until it converges. It turns out: as £ 00, it converges: 


xO — x*, (1.29) 


as long as the learning rate is chosen properly, like the one decaying exponentially. 
We will not touch upon the proof of this convergence. In fact, the proof is difficult — 
even there is a big field in statistics which intends to prove the convergence of a 
variety of algorithms (if the convergence holds). 


© is placed. Hence, we focus on the two- 


2. Ina higher-dimensional case, it is difficult to visualize how x 
dimensional case. It turns out that gradient descent works even for high-dimensional settings although it is 


not 100% intuitive. 
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slope: Vi (a) 


r* xt) th estimate 


tt) © 7O_gOV f(c”) 


Figure 1.10. How gradient descent works. 


Constrained minimization As mentioned in the beginning of this section, to 
give insights in the constrained minimization, we will rely upon a well-known the- 
ory that we will touch upon in depth later in Part II. The theory is called strong 
duality. There is a rigorous mathematical statement of strong duality. But here we 
will state only an implication of strong duality because in that way we can even 
understand a rationale without introducing any further notations and concepts. 
The implication of strong duality is: 


Convex-constrained problem can be translated into an unconstrained 


convex optimization without loss of optimality. 


This suggests that it suffices to consider a translated unconstrained minimization; 
hence, we can rely on efficient algorithms like gradient descent to solve the prob- 
lem. In summary, we have efficient algorithms for solving both unconstrained and 
constrained convex optimization. This is exactly the reason why we define convex 
optimization in the particular manner mentioned at the beginning. 

You may be eager to know about strong duality right now because we relied 
heavily on the theory for the purpose of explaining the tractability of convex opti- 
mization. However, we will not touch upon it now because the proof requires many 
deep backgrounds. If we do so now, you will be easily distracted, potentially losing 
an interest in the optimization field. Please be patient until we get to that point in 
Part II. 


Overview of Linear Program (LP) From now on, we will start investigating 
several instances of convex optimization problems. One instance that we will take 
first is: Linear Program (LP). 


minf(x): fi(x) <0, ze {1,...,m}, 


h(x) =0, i€ {l,...,p}. (1.30) 


Here we say that (1.30) is an LP if all functions f(x), fi(x)’s, Ai(x)’s are affine. 
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Since Kantorovich’s breakthrough, people realized that many interesting/impor- 
tant problems can be formulated as LPs such as: (i) resource allocation problems 
(like the military-related problem that Kantorovich considered); (ii) transportation 
problems (Monge, 1781) (important problems that arise in economics and other 
relevant areas); (iii) the classification problem (one of the most classical and popu- 
lar problems in machine learning); (iv) the network flow problem (a fundamental 
problem in computer networks); and so on and so forth. 

Moreover, some very difficult problems in which the optimization variable is 
boolean (binary values) can be approximated as an LP via a relaxation technique. 
Very interestingly, in some cases, the LP relaxation yields the exact solution to the 
original problem, meaning it comes without loss of optimality. 


Look ahead Through a couple of upcoming sections, we will deal with the above 
examples together with algorithms and software implementation. Specifically we are 
going to cover four stuffs. First we will study a few examples that can be formulated 
as LPs. Second, we will study the LP relaxation technique which we claimed useful 
for some very difficult problems. Third, we will investigate efficient algorithms for 
solving LPs. Lastly we will study how to implement such algorithms using software 
tools like CVXPY running in Python. 
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Problem Set 1 


Prob 11 (Least Squares) Let A := [a1 «++ dm]? and b := [bi +++ byl? 
where a; e R? and b; € R, ¿€ {1,..., m}. 


(a) Consider a function f : R?5R Fa) = la7x— bı I++ lla x— bnll 
where x € R? and || - || denotes the Euclidean norm. A student claims that 
the function f can be represented as: 


f (&) = ||Ax — bll. (1.31) 


Prove or disprove the claim. 
(b) Consider another function F : R> R: 


f(x) = ||Ax — OI’. (1.32) 


Using only the definition of a convex function, show that f (x) is convex in 
d 
x eR’. 


Prob 1.2 (Basics on gradients) Let f : R” > R be a differentiable function. 
Let x = [x1,x2,...,xy]7 € R4,BeE R?*4 and c € R’. 


(a) Explain the concept of gradient Vf with respect to (w.r.t.) x (or 
denoted by V,/). 

(b) Suppose f(x) = x! x. Derive Vf w.r.t. x. Show all of your detailed 
derivation. 

(c) Suppose f(x) = x! Bx. Derive Vf watt. x. Show all of your detailed 
derivation. 

(d) Suppose f(x) = c!x. Derive Vf wart. x. Show all of your detailed 
derivation. 

(e) Suppose f(x) = xTc. Derive Vf w.r.t. x. Show all of your detailed 


derivation. 
Prob 1.3 (Representation of convex sets) 


(a) In Section 1.2, it was claimed that a plane in a 3-dimensional space can be 
represented as 


S={x:a'x—b=0}. (1.33) 


where x, a € R? and 6 e R. Explain why. 
(b) State the definition of a hyperplane. What is the representation of 
a set that indicates a hyperplane in a d-dimensional ambient space? 
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Specify the dimension of matrices and vectors (if any) employed in your 
representation. 

(c) State the definition of a polytope. What is the representation of a set that 
indicates a polytope in a d-dimensional ambient space? Specify the dimen- 
sion of matrices and vectors (if any) employed in your representation. 


Prob 1.4 (Convex sets) Let ff : Rf > Rand f : R > R be convex func- 
tions. 


(a) In Section 1.2, we proved that S; = {x : f(x) < 0} is convex for 7 € {1, 2}. 
Show that S = S1 N S2 is convex. 

(b) Suppose Sı and S2 are arbitrary convex sets. Prove that S = S1 N Sz is 
convex. 

(c) Suppose f : R — R be a concave function. A student claims that C = 
{x : f(x) < 0} is not convex. Prove or disprove the claim. 


Prob 1.5 (Ist-order condition of convexity) Suppose f : R? > R is differ- 
entiable, i.e., its gradient Vf exists at each point in domf. In Section 1.3, it was 
claimed that f is convex if and only if 


domf is convex; 


(1.34) 
f(x) = VEET (x —x")+ f(x") Vx,x" € domf. 


This problem explores the proof of the above. 


(a) Suppose d = 1. Show that if f (x) is convex, then (1.34) holds. 
(4) Suppose d = 1. Show that if (1.34) holds, then f(x) is convex. 
(c) Prove the claim for arbitrary d € N. 


Prob 1.6 (Composition of convex functions) Let f : R > R be a real- 
valued function. The same holds for g, fi and fo. 


(a) Show that if the functions f(x) and g(x) are convex, so is f (x) + g(x). 

(b) Let A and 6 be a matrix and a vector of compatible size. Show that if f (x) 
is convex, so is f (Ax + b). 

(c) Show that if fi and fp are convex, so is max{f{ (x), /2(x)}. 

(d) Show that if f(x) is convex, then — log(—f (x)) is convex on {x : f(x) < 0}. 


Prob 1.7 (Jensen’s inequality) Suppose that f : R? — R is convex, 
X1,...,x% E domf, and 21,..., Ak > O with 4; +--+ + Az = 1. Show that 


fixi +++ + Axe) < Afr) ++ + Af (xe). (1.35) 


Also, identify the conditions under which the equality holds. 
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Prob 1.8 (Logistic function) Letx,w € R72. Let 


(a) 


o (t) := >, tER. (1.36) 


1 + et 


A student claims that L(w) := —log o (w? x) is convex in w. Prove or 
disprove the claim. 


(b) A student claims that L(w) := log (1 -0o (wTx)) is convex in w. Prove or 


disprove the claim. 


Prob 1.9 (Convex function vs convex set) Let f : R? > R be a real-valued 
function. In Section 1.2, we proved that if f (x) is convex, S = {x : f(x) < O} isa 


convex set. A student claims that if S is a convex set, f(x) must be convex. Prove 


or disprove the claim. 


Prob 1.10 (Gradient descent) 


(a) 


(b) 


Consider an optimization problem: 
minx? — 2x (1.37) 


where x € R. Suppose we perform gradient descent with an initial point 
x) = 5 and the learning ratea = +. Using Python to plot the objective 
function evaluated at x“) as a function of £ where x“) denotes the estimate 
at the t-th iteration. What is the limiting value of x“)? Does it converge to 
the optimal solution that you can obtain analytically? 

Now consider: 


max —x* + 4x (1.38) 


where x € R. Can you apply gradient descent to approach the optimal 
solution numerically (without converting max into min optimization)? If 
so, redo part (4) with the same initial point and the learning rate. Otherwise, 
come up with another iterative algorithm which bears similarity to gradient 
descent, yet which allows us to obtain the optimal solution. Also explain 
why. In addition, plot the objective function evaluated at x“) as a function 
of t where x denotes the z-th estimate of your algorithm. 


Prob 1.11 (True or False?) 


(a) 


A line is represented as: 
S = {x:a'x—b=0} (1.39) 


where a,x e Rf and beR. 
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(b) Consider a set {(x1, x2) € R? : & > 1,x, > 0,x2 > 0} where x1,x2 € R. 
. x2 
The set is convex. 
(c) Consider a set {(x, t) € R+! : xT% < t} where x e R4,t E Rd EN. 


The set is convex. 
(d) Suppose that fi, f, f» fá : R? — R are concave functions. Then, 


max{min{fi (x), A(x)}, min{f (<) AAH (1.40) 


is convex. 
(e) Consider functions f;(x), 7 € {1,2,...,m}. Then, the following function 


max > Afi) (1.41) 


xedomf f i 
1= 


is always convex in A := [A],...,4m] T eR”, 
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1.4 Linear Program (LP) 


Recap In Section 1.3, we tried to understand why convex optimization stated 
below is tractable, i.e., it can be solved efficiently via an algorithm even if the closed 
form solution is unknown: 


min f(x): fi(x) < 0,2 € {1,..., m} 
hj(x) = 0,7 € {1,...,p} 


(1.42) 


where f (x) and fj(x)’s are convex and /;(x)’s are affine functions. To this end, we 
considered two cases: (i) unconstrained case; and (ii) constrained case. For uncon- 
strained minimization, we first derived the necessary and sufficient condition for 
x* to be optimal: V/(x*) = 0, assuming that there exists a stationary point and 
that f is differentiable at every point € domf, which is open. We then investi- 
gated one prominent algorithm for finding such a solution: gradient descent. The 
gradient descent is an iterative algorithm which allows us to achieve x* numer- 
ically. For constrained minimization, on the other hand, we relied upon a the- 
ory so called strong duality which enables translating a convex-constrained prob- 
lem into an unconstrained convex problem without loss of optimality. So with 
this theory, we can use algorithms like gradient descent to address the convex- 
constrained problem. This is how we understood why convex optimization is 
tractable. 

We then moved onto one simple instance of convex optimization: Linear Pro- 
gram (LP). The standard form reads: 


minw’x: Ax—6<0, 


1.43 
Cx -—e=0 ( ) 


where w, A, b, C and e are of compatible size and the inequality is component-wise 
one. At the end, we claimed that many interesting and important problems can be 
translated into LPs. 


Outline The goal of this section is to defend the claim. To this end, we will study 
two prominent and historical examples which can be formulated as LPs. The first 
is a resource allocation problem which is also called an optimal planning problem. 
This has been the most important problem in economics and operation research 
in the 20th century. Specifically we will investigate a historical problem (explored 
by the father of LP, Leonid Kantorovich (Gardner, 1990)), which later gave inspi- 
rations to the development of LP. The second is a transportation problem which 
has been playing a crucial role in a variety of fields for centuries. In particular we 
will study a problem explored by the farther of Transportation Theory, Gaspard 


Linear Program (LP) 29 


| wood 1 wood 2 
machine 1 10 units/time 10 
machine 2 20 40 


Figure 1.11. Machine capabilities of peeling woods. 


Monge, a French mathematician in the 18th century (Monge, 1781). While inves- 
tigating these examples, we will learn a couple of techniques that serve to formulate 
an LP: (i) how to express conditions (given in a problem) in terms of vector and 
matrix notations; (ii) how to set up a proper optimization variable that yields an 
LP formulation; and (iii) how to translate a convex function into an affine function 
that arises in the standard form of LP. 


Kantorovich’s plywood cutting problem (Gardner, 1990) One of the 
problems that Kantorovich considered is the plywood cutting problem. Kantorovich 
encountered the problem in 1937 while interacting with plywood engineers. For 
simplicity, we consider a much simpler version of the original problem. 

The problem is about allocating the time for the use of different machines for 
peeling different kinds of woods. Suppose there are two kinds of woods to peel, 
say wood 1 and wood 2. Also there are two different peeling machines: machine 
1 and machine 2. Each machine has a different capability of peeling. Machine 
1 can peel 10 units/time for either type of wood. On the other hand, machine 
2 can peel 20 units/time for wood 1 while peeling 40 units/time for wood 2. 
See Fig. 1.11. 

The goal of the problem is to maximize the total wood production. But there is 
a constraint. The constraint is that production is desired to meet the equal propor- 
tion, i.e., the amount of wood 1 peeled is desired to be the same as that of wood 
2. If there is a remnant part which exceeds the equal proportion, then it is simply 
discarded. So the objective is to maximize the minimum of wood 1 and 2 products: 


max min {wood 1 product, wood 2 product} . (1.44) 


What is an optimization variable? In other words, what is a quantity that affects 
the objective function? That is, the time that we use for peeling each wood with 
a certain machine. Let x; be machine 1’s time for peeling wood 1. Normalizing 
the time, we can assume that 0 < x; < 1. Assuming that machines are always 
operating, machine 1’s time for wood 2 would be 1 — x1. Similarly define 0 < 
x2 < 1 as machine 2’s time for peeling wood 1. Using these notations together with 
machine capabilities illustrated in Fig. 1.11, we get: wood 1 product = 10x; +20x2 
(units); wood 2 product = 10(1 — x1) + 40(1 — x2) (units). Now applying this 
to (1.44) and flipping the sign of the objective function, we obtain a minimization 


30 Convex Optimization Basics 


problem as follows: 


min max {— 10x; — 20x2, 10 (x1 — 1) + 40(2 — 1)}: 
(1.45) 
0<x, <10<%»% <1. 


Translation to an LP Notice in (1.45) that the objective function max{., -} 
(marked in red) is convex (why? check in Prob 1.6), but it is not affine, so it is not 
an LP. This is exactly where one important technique kicks in. One technique that 
allows us to convert such a convex function into an affine function is to introduce 
another variable, say x3, such that: 


x3 > —10x; — 20x2, x3 > 10(x1 — 1) + 40(x2 — 1). (1.46) 
Now consider the following optimization: 
min x3 : 


0<x <1, O<m <1, 


(1.47) 
= 10x, = 20x2 — X3 < 0, 
10x; + 40x2 — x3 — 50 < 0. 
Here the key observation is that the minimizer, say x3, is achieved at: 
x3 = max {— 10x; — 20x2, 10(x) — 1) + 40(x2 — 1)}. (1.48) 


Otherwise, i.e., if x} > max{—10x; — 20x2, 10(x1 — 1) + 40(x2 — 1)}, it con- 
tradicts with the hypothesis that x3 is the minimizer. Hence, the translated prob- 
lem (1.47) is equivalent to the original problem (1.45). 

Note that the objective function and all of the functions that appear in inequality 
constraints are affine. Hence, the problem is an LP. Using vector/matrix notations, 
we can also represent this in the following standard form: 


minw!x: Ax—b<0 (1.49) 
where: 
—1 0 0 
1 0 
1 
j 0 -1 0 0 
w= ; A= 0 i 0 = i (1.50) 
—10 —20 -1 0 
50 


10 40 -1 
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The first rows of A and 6 come from —xı < 0, and the second rows are due to 
xı — 1 < 0, and so on and so forth. 


Monge’s transportation problem (Monge, 1781) The second problem that 
we will study is another historical problem: Monge’ problem, explored by Gaspard 
Monge, a French mathematician who lived in the 18th century. In 1781, he pub- 
lished a memoir, titled: Mémoire sur la théorie des déblais et des remblais (Monge, 
1781). In the memoir, he introduced a transportation problem which later laid 
the foundation of transportation theory. In particular, the field of transportation 
theory was revolutionized in the 20th century by the recognition that the Monge’s 
transportation problem can be translated into an LP. This recognition was made 
by Kantorovich (Kantorovich, 1960). So here we will figure out how Kantorovich 
recognized it as an LP. 

Monge’s problem is about transporting soils (mined in several grounds) into 
construction sites, each of which demands a certain amount of soils for construc- 
tion purpose. For instance, let us consider an example illustrated in Fig. 1.12. Sup- 
pose there are three grounds (marked in black squares) and four construction sites 
(marked in hollowed circles). For each ground, a certain amount of soils can be 
mined. Let s; be the amount of soils mined in ground 7 € {1, 2, 3}. For simplicity, 
we assume that s;’s are normalized such that s1 + s2 + s3 = 1. Let dj indicate the 
amount of soils demanded at construction j € {1, 2, 3,4}. Assume that the total 
demand is the same as the total supply. Then, dı + 42 + 43+ d4 = s1 +22 +53 = 1. 

The goal of the problem is to find an optimal coupling such that the transporta- 
tion cost is minimized. To figure out how to achieve this, we first need to understand 
how the transportation cost is determined. We assume that the cost is proportional 
to two factors: (i) distance between a ground and a corresponding construction site 
and (ii) the amount of the soils sent. To quantify the distance, we need to define 
coordinates of locations of grounds and constructions. Let x; and y; denote loca- 
tion coordinates of ground i and construction j, respectively, where 7 € {1, 2, 3} 


yı 
amount of soil sent O 
Px y(x) sı @y: amount of soils demanded 23 
Tı m 
oO Px,y (x3, ya) = 0882 [$3 
$1: amount of soil 
y3 
T (x3, ya) = 0.283 
Px,y (22, ys) =Æ 3 


T2 
Y2 

— uE 
O Px y(22,y2) = 0.5s: $2 d4 
dy 


Figure 1.12. A particular coupling in Monge’s transportation problem. 
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and j € {1,2,3,4}. Then, the distance between ground 7 and construction j can 
be written as: 


dist(ground 7, construction j) = ||x; — yll (1.51) 


where || - || denotes the Euclidean norm. 

How to represent the amount of soils delivered from ground 7 to construc- 
tion 7? For ease of illustration, let us consider a particular coupling illustrated in 
Fig. 1.12. This coupling suggests that the soils mined in ground 1 are transmitted 
only to construction 1. So the amount of soils must read s1. Lers denote that by 
Py y(x1,91) = sı. Later you will see why we use this complicated-looking nota- 
tion Py y(-,-). For ground 2, the soils are split into two construction sites: con- 
structions 2 and 3. Assume the equal split. We can then represent the splitting by 
Py y (x2, 72) = 0.552 and Px, y(x2, 3) = 0.552. Similarly for ground 3, the soils 
are split into constructions 3 and 4, but with an asymmetric split, say 8:2. So we 
have Px y (x3, 73) = 0.853 and Px y (x3, y4) = 0.253. Notice that the soil alloca- 
tion, determined by the values of Px, y (xj, y;)’s, is the one that we can control over. 
So this is an optimization variable. It is a 12-dimensional vector in this example. 

Next think about constraints posed in the problem. The constraints are two 
folded: (i) all the soils mined in each ground should be transmitted to construction 
sites; and (ii) the demands of all the constructions should be satisfied. In terms of 
mathematical notations, this means that: 


4 
S Prysg) =5 ie{1,2,3} (1.52) 
j=1 
2 
S Prr) =d je {1,2,3,4}. (1.53) 


i=1 


We can then write down the optimization problem as follows. Given (s;, d;)’s 
and (x;, yj)'s: 


3 4 
min S >» Px,y (xis) lx — yll : 
— a ee 
=l j=l amount of soils distance 
4 
> Pxv@iy) =5 ie{l,2,3}, ane) 
j=! 


3 
S Prr) =d je {1,2,3,4}. 
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The objective function and all the functions that appear in the constraints are affine 
w.r.t. the optimization variable Py, y (xj, yj). Hence, it is an LP. 


Wasserstein distance If you think about the formula (1.54), we see that one 
can succinctly represent it by relying upon the concept of probability distribution. 
The succinct form that you will see soon gives an insightful interpretation on the 
transportation problem. Remember that s;’s and d,’s are normalized: sı + s2 + 
s3 = land dı + d + d3 + d4 = 1. So one can view this as a valid prob- 
ability distribution. For example, defining Px(x;) := s; we see that Px(x;) is 
a probability distribution. Similarly, Py(yj) := dj is another valid probability 
distribution. 

Keeping these in our mind, let us take a careful look at the constraints in the 
above optimization problem (1.54). What do the constraints remind you of? They 
remind you of the total probability law! Hence, one can view Py, y (x;, yj) as another 
valid probability distribution. This probabilistic viewpoint then allows us to sim- 
plify the optimization problem (1.54) as follows. Given (Px, Py), 


W (Px, Py) := min E [IIX — Y ||] (1.55) 
Px,y 


where the minimization is over all joint distributions Py,y which respect 
marginals: 


S Pxy@iy) = Px) Yxi 
(1.56) 


> Px yy) = Pyro) Yy. 


Here W (Py, Py) is a function of Py and Py. So one nice interpretation is that 
it can be viewed as sort of distance between the two distributions. Notice that 
Py = Py gives W (Py, Py) = 0, while distinct marginal distributions yield larger 
W(.,-). This succinct expression (1.55) was recognized by Kantorovich and other 
person, named Rubinstein. So it is called the Kantorovich-Rubinstein distance. The 
distance measure was generalized later by incorporating an arbitrary pth-order expo- 
nent in ||X — Y|| G.e., |X — Yl?) (Vaserstein, 1969). The general measure con- 
cerning ||X — Y||? is called the pth-order Wasserstein distance. So W(Px, Py) is 
called the 1st-order Wasserstein distance. 

It turns out the Wasserstein distance appears in many of the optimal transporta- 
tion problems as a key measure, thus revolutionizing the field of transportation 
theory. Very interestingly, the Wasserstein distance played a crucial role in design- 
ing a famous machine learning model, called Generative Adversarial Networks 
(GANs) (Goodfellow et al., 2014), thus leading to the development of Wasserstein 
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GAN (Arjovsky et al., 2017) that has been proved powerful in many applications. 
We will investigate details on this in Part III of this book. 


Look ahead In the next section, we will study one more example for LP. We will 
also do what we were planning to do for LP: Studying an LP relaxation technique 
which turns out to be instrumental in addressing some very difficult problems. 
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1.5 LP: Examples and Relaxation 


Recap In the previous section, we explored two historical examples that can be 
translated into LPs, as well as have played a big role in the fields like economics, 
operation research and transportation: 


1. Kantorovich’s plywood cutting problem; 
2. Monge’s transportation problem. 


In the process, we learned about a couple of techniques which allow us to translate 
problems into LPs. We also gained some insights as to how to recognize LPs. 


Outline In this section, we will study another interesting and prominent 
example: 


3. Classification problem. 


This is arguably the most classical problem that arises in machine learning. Next 
we will deal with another claim that we made in Section 1.3. Some very difficult 
problems can be solved via LP relaxation. Specifically we will study a canonical 
example in which the LP relaxation can provide the exact solution in some cases. 


Classification problem The last example that we will touch upon is a very pop- 
ular problem that arises in a widening array of fields such as machine learning, arti- 
ficial intelligence, finance, and real estates. In particular, it is the most classical and 
canonical problem in machine learning. 

For illustrative purpose, let us explain what the problem is under a simple task 
setting: classifying legitimate emails against spam emails. Suppose there are two 
datasets. One dataset contains data points (also called samples or examples in the 
machine learning field) concerning spam emails. The other includes those concern- 
ing legitimate emails. Assume that each data point is represented by two features: (i) 
frequency of keyword 1 (say, dollar signs $$); and (ii) frequency of keyword 2 (say, 
winner). In machine learning, the feature is a frequently used terminology which 
refers to a key component that well describes characteristics of data. Denote each 
data point by x := (x, x) where x) and x) indicate the two features: fre- 
quencies of keyword 1 and 2 contained in the ith email, respectively. See Fig. 1.13 
for data points in two datasets, blue (legitimate) and red (spam) datasets. Here we 
are also given a label which indicates whether data point x comes from a legiti- 
mate email (y = 1) or from a spam email (y) = —1). Assume that we have m 
such paired samples. 
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frequency of keyword 2 


at'x—b=0 


frequency of keyword 1 


Figure 1.13. Visualization of two datasets for email classification: one for spam emails; 
the other for legitimate emails. 


Given these data points together with labels {(x, y)}”_,, the goal of the 
classification problem is to find a /ine that separates two datasets. We consider 
the simplest classification approach, called /inear classification, which pursues to 
find a linear function. A line can be represented as a linear equation in the two- 
dimensional space: a'x—b = 0. Note that a7 x — b > Oisa half-space that covers 
the region above the line (where the blue data points reside) while aTx—b<0 
indicates another half-space spanning the bottom region (where the red data points 
reside). Hence, in order for a line to separate the two datasets, it must hold 
that: 


yË (atx —b)>0 ie {1,..., m} (1.57) 


Notice that the optimization variables are (a, b) instead of (x, y)’s. You may 
be confused about the notations because we have so far used x notation for the 
optimization variable. Here we use the x notation to indicate data points, which is 
a sort of convention in machine learning. 

Notice in (1.57) that (a, 6) = (0,0) always satisfies the constraint. However, 
it is obviously not of interest, since the trivial solution (a, 6) = (0,0) makes the 
classifier play no role. Hence, one may want strict separability, meaning that a strict 
inequality may be preferred in (1.57): 


yO (at x9 —b)>0 iE {],...,m}. (1.58) 


However, this is also problematic. Remember the standard form of optimization 
problems. The inequality constraints should be of “<”. In fact, the rationale behind 
the use of this form is related to the strong duality that we will study in depth in 
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Part II. For now, let us just adopt this form. Given this form, the strict inequality 
in (1.58) is indeed problematic. 

In order to address this, one can introduce a tiny value, say €, so that we obtain 
an inequality constraint like: 


yO(aT x — by >. (1.59) 


Here the scale of the values that appear in the inequality does not affect finding the 
optimal solution (a*, 6*) that we will seek shortly. Hence, without loss of generality, 
we consider the normalized version instead: 


yÊ (aT x — b) >1 że {1, sneer m}. (1.60) 


Again, whenever we have a strict inequality as in (1.58), one can obtain the inequal- 
ity like (1.60) by properly scaling (a, b). 

As long as the above constraints are satisfied for all data points (x, y)’s, we are 
done, meaning that there is nothing to minimize or maximize. One traditional way 
to represent an optimization problem in such a case is to set an arbitrary yet constant 
value as an objective function. So one such problem reads: Given {(x, y)}_,, 


min 0; yO(aTx® —b)>1 71E {1,...,m} (1.61) 
a, 


where the objective function is simply set to 0. Notice that we have nothing to 
minimize. Observe that all the functions that appear are affine. Hence, it is an LP. 

On a side note: You may wonder how we can apply an algorithm like gradient 
descent, since we have a constant objective with 0 derivative all the time. Notice that 
this is a constrained optimization. Remember the claim that we stated in Section 1.3. 
A convex constrained optimization can be translated into an unconstrained version 
thanks to the consequence of strong duality. It turns out the objective function in 
the unconstrained counterpart is indeed a function of the optimization variable. 
This point will be clearer in Part II. 


Non-separable case In the classification setting, one may ask the following nat- 
ural question. What if datasets are not linearly separable, as illustrated in Fig. 1.14? 
Notice in Fig. 1.14 that some red points reside near a cluster of blue points, 
and also some blue points are mingled with a majority of red points. Obviously 
there is no line that separates the two datasets. This non-separable case often 
occurs in various tasks in machine learning. For instance, in the cat-dog classifi- 
cation problem, the boundary that separates cat dataset and dog dataset is highly 
non-linear. 
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frequency of keyword 2 
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e ! ° aTr—-b=0 
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z frequency of keyword 1 
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Figure 1.14. Non-separable case of email classification. 


One naive’ yet natural way to handle this non-separable case is to introduce the 
concept of margin. For some outlier data points (x, y), we introduce margins, 
say v® > 0, such that 


yO (aT x — b)+ v® >1 je{l,..., m}. (1.62) 


Whenever y (a? x — b) is strictly less than 1 (which is undesirable), we introduce 
a positive margin v so that the sum of them is greater than or equal to 1. Obviously 
the smaller the margin, the better the situation. 

We can then set out our new goal as: minimizing the aggregated margins while 
respecting the above constraint (1.62). Hence, one can formulate the optimization 
problem as: 


m 
min v® : 
ab,v * 

i=1 


. . l (1.63) 
yÈ (aTx® -b)+ >1 ie{l,... m} 


py) >0 z€e€({l,...,m}. 


Again this is still an LP. 


3. Amore powerful yet sophisticated way is to employ deep neural networks. During the past decades, there has 
been a breakthrough in machine learning. It has been shown that deep neural networks can well represent 
any arbitrary (possibly highly non-linear) functions with reasonable computational complexity in view of 
current technologies. So one can use such a network to implement a non-linear classifier to do much better. 
We will get to this point in depth in Part III. Neural networks are systems with one or multiple layers in 
which each layer consists of an affine operation and an arbitrary (possibly non-linear) operation (called the 
activation function in the literature). Input and output in each layer are high-dimensional vectors and each 
component in the vectors is represented as a circle and called a neuron. The naming was originated from 
the fact that the structure looks like that of brain networks. Deep neural networks refer to the one having at 
least one hidden layer between input and output layers. 
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A class of difficult yet solvable problems Recall the claim that we made 
earlier. Some very difficult problems can be solved via LP relaxation. So let us study 
the technique in the context of such a difficult problem class. One such prominent 
class is: a class of boolean problems, which can be formally stated as below. 


pe = minw" x: 
Ax —b<0, Cx—e= 0, (1.64) 
ES 10,1}, 2S {Lsd 


We see the last additional constraint, saying that the optimization variable is con- 
strained to be boolean. We often use the following shorthand notation: x € {0, 1}4. 
There are many problems that can be formulated as above in the real world. To get 
some feeling about the problem nature, let us explore one popular example. 


Example: Shortest path problem (Dijkstra et al., 1959) The popular prob- 
lem that we will delve into is a fundamental problem in combinatorial optimization, 
named the shortest path problem. The problem is about finding a path from a source 
to a destination in a graph. 

For ease of illustration, let us consider an example in Fig. 1.15. To understand 
what this problem is, we need to first know about the concept of graphs. A graph, 
denoted by G, is a collection of two sets: (i) a vertex (or node) set, denoted by V; and 
(ii) an edge set, denoted by £. The vertex set includes many nodes indicating some 
objects of interest. The edge set includes many edges indicating some connections 
between two nodes. Here we have a graph G = (V, €) where: 


V = {1,2, 3,4, 5, 6}; 
E = {(1,2), 2, 1), (1, 5), 5, 1), (2, 3), G, 2), (2, 5), 6, 2), 
(3, 4), (4, 3), (4, 5), (5, 4), (4, 6), (6, 4)}. 


We consider a bi-directed graph in which the edges are bi-directional. Let node 1 
and node 6 be source and destination, respectively. A path is defined as a sequence of 
edges that connects the source to the destination. See an example path in Fig. 1.15, 
marked in a green line: (1,5) (5,4) — (4,6). Let xj indicate whether the edge 


T46 = 1 


1 
(4) (6 ) destination 


Figure 1.15. Illustration of the shortest path problem. 
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(i, j) is participated in the path. So in this example, x15 = x54 = x46 = 1 while all 
others are 0, i.e., x51 = 0, x12 = 0, . . . , x34 = 0, x64 = 0. Notice that the path has 
a direction. So x51 = x45 = x64 = 0, since those are reverse to the direction of the 
path flow. 

The goal of the problem is to find a path such that the total cost is minimized. 
It is assumed that the cost is decided by the weight w; (non-negative) that comes 
with an edge (i, 7). Hence, the objective can be stated as: 


min WijXij. (1.65) 
xjY(ij)EE 


Here the constraint is that x;’s should be set to ensure a valid path, i.e., con- 
necting the source to the destination. Then, how to check whether or not a path is 
valid? The validation is a bit tricky. But if you think about the nature of the flow 
of a valid path, you can come up with an easy way to check. 

Consider the flow at source node 1 in the example. One key observation is that 
the flow is just outgoing, i.e., there is no ingoing flow. So we should read: 


outgoing flow — ingoing flow = 1 


_— > ie > =) (1.66) 


CO yjeé G,leEE 


where J q pegi indicates the entire flow that comes out of source 1 and 
DG ee %j1 denotes the aggregated flow that goes into source 1. On the other 
hand, at the destination, the situation is reversed: 


outgoing flow — ingoing flow = —1 
1.6 
=> >) xy- 2, =l. mee 
(6y)eE G,0)EE 


For other node, say u (neither source nor destination), the flow is just passing, mean- 
ing that the flow coming in must go out. So we have: 


outgoing flow — ingoing flow = 0 


<=> >) xg — D>) Hu = 0. (1.68) 


(u j)EE u)EE 


Using the above, we can formulate the optimization problem as: 


min > WijXij : 
y(i j)EE 
xyvlaj)e (ij)e€ 
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© Da 
(y)eE G,1)EE 
(6, j)EE G,6)EE 
S ay Y a0 Yee (2.3.4.5) 
(u j)EE GuyeE 
xy € {0,1} Vj) € E. (1.69) 


We see that this is a boolean problem (1.64). 


LP relaxation The boolean problem (1.64) is known to be notoriously difficult 
in general. In most cases, we need to search over all possible binary choices of x to 
figure out the optimal solution. So the complexity scales like 27, growing exponen- 
tially with the dimension d. 

To deal with such difficult problems, people thought about a way to move for- 
ward. One natural way is to just ignore the binary value constraint (the very cause 
of making the problem intractable). This natural way is indeed relaxation. Simply 
ignoring the binary value constraint in (1.64), we obtain: 


pro = minw? x: 
Ax— b< 0, Cx -e=0, (1.70) 
Ot 1,22 T 


where x; is relaxed to be any real value € [0, 1]. 
Since it is a more relaxed problem, we can do better, so in general, 


P pip (1.71) 


Interestingly, it turns out that under some situations of shortest path problems, the 
optimal solution for the relaxed problem is binary: xj;,, € {0, 1}, thus implying 
that p* = pip. We will not prove this here. Instead you may want to check this 
numerically via programming tools. If you are interested in further details, you 
may want to take a graph theory course offered in math and/or computer science 


departments. 


Look ahead So far we have studied the three historical examples that can be 
formulated as LPs, as well as some boolean problems which can be solved via LP 
relaxation. In the next section, we will cover one remaining part among what we 
were planning to do for LP. That is, investigating efficient algorithms. 
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1.6 LP: Algorithms 


Recap During the past two sections, we have studied three important problems 
that can be formulated as LPs: (1) Kantorovich’s plywood cutting problem (Gard- 
ner, 1990); (2) Monge’s transportation problem (Monge, 1781); and (3) the linear 
classification problem. We also investigated a class of difficult problems which can 
be however solved via LP relaxation: Boolean problems. As an example, we explored 
the shortest path problem (Dijkstra eż al., 1959). 


Outline In this section, we are going to study an efficient algorithm for LP. We 
will focus on one particular algorithm, called the simplex algorithm. Specifically 
what we are going to do are three folded. There are a couple of historical and 
famous algorithms for LP. First off, let us provide the rationale behind why the 
simplex algorithm is picked up among the algorithms. We will then investigate 
the standard form that the algorithm relies upon. Lastly we will study how the 
algorithm works in detail. The algorithm is intuitive and beautiful. You will check 
this soon. 


Algorithms for LP Three major algorithms have been recognized for LP. The 
first is obviously the one that the father of LP, Kantorovich, developed. The sec- 
ond is a very famous and faster algorithm, called the simplex algorithm (Dantzig, 
1951). The algorithm was developed in 1947 by an American mathematician, 
named George Dantzig. Actually some scientists and mathematicians in the West, 
especially at Berkeley (where Dantzig obtained PhD) and Stanford (where he was 
a Professor), claimed that the inventor of LP is Dantzig. The claim was based 
on the fact that the simplex algorithm is the first one that solves any LP in a 
finite number of steps (which was not revealed by Kantorovich) as well as the 
fact that the naming of LP was first used in print by Dantzig. However, many 
people did not accept the claim, and perhaps more importantly, the Nobel Prize 
committee was silent on this. Kantorovich’s contribution for the Nobel Prize was 
actually on the optimal allocation of scarce resources, which is not the invention of 
LP although highly related (Kantorovich, 1989). If the committee had wanted 
to award a prize for LP, then Dantzig should have been included. The last algo- 
rithm is a very generic algorithm, called the interior point method (Dikin, 1967; 
Wright, 2005), which can be applied to general convex optimization problems 
not limited to LP. The algorithm is based on strong duality. Since strong dual- 
ity will be covered later in Part II, we will study the algorithm around at the time. 
This is the reason that the simplex algorithm is picked up for the focus of this 
section. 
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Standard form for the simplex algorithm Remember the standard form 
of LP: 


minw!x: Ax—b<0, Cx— e= 0. (1.72) 


The simplex algorithm relies upon a differently-looking yet equivalent form, called 
the standard form for the simplex algorithm: 


max wlx: Ax < b, 
(1.73) 
x>o0. 


Later you will see why this form (1.73) helps run the algorithm. One can readily 
see that (1.73) belongs to the class of (1.72). It turns out the other direction is also 
true, i.e., any form like (1.72) can always be converted into the form like (1.73). 


How to convert (1.72) into the standard form (1.73)? To show the con- 
version from (1.72) to (1.73), we need to demonstrate four things. The first is to 
convert min to max. This can be done very easily by flipping the sign of the objective 
function. The second is to convert the equality constraint into inequality one(s). 
This is also immediate because: 


Cx—e=0 & Cx<e, Cx >e. 


The last is to ensure that all the optimization variables are non-negative, i.e., 
x > 0. This can be done in the following manipulation. Suppose there is no sign 
constraint on a variable, say xı. Then, by introducing two new non-negative vari- 
ables, say x2, x3 > 0, we can cover the case by setting: 


X1 = X2 — X3, X2 X3 = 0. 


Here one important note is that using the equality x) = x2 — x3, we should replace 
all x;’s (that appear in other constraints if any) with x2 — x3, so that there is no x1 
in the final form. 


Simplex algorithm: Conversion into the slack form In fact, the precise 
description of the algorithm is complicated although the idea is simple and insight- 
ful. So we will focus only on grasping the key idea through the following example: 


max 5x1 + 4x2: 
3x1 + 5x2 < 78 
(1.74) 
4x, + x2 < 36 


X1,x2 2 0. 
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The algorithm starts with conversion into another form, called the slack form. In 
the slack form, two types of new variables are introduced. One is the target variable, 
usually denoted by z, which indicates the objective function itself: 


z = 5x1 + 4x. 


The others are the slack variables, usually denoted by s;’s. These are used for making 
the inequality constraints equality ones. For instance, 3x1 + 5x2 < 78 can be 


equivalently written as: 
3x1 + 5x2 + s1 = 78, 5, > 0. 


You may wonder why we convert back into equality constraints, since we had 
already converted the equality constraints in (1.72) into inequality constraints as 
in (1.73). The reason is relevant to a certain condition (to be described in the sequel) 
that the translated equality constraints should respect. 

To see this clearly, we re-write (1.74) as follows with the target and slack variables: 


maxz : (1.75) 

z— 5x1 — 4x2 = 0 (1.76) 
3x1 + 5x2 + s1 = 78 (1.77) 
4x, + x2 +52 = 36 (1.78) 
X1 X2, 51,52 > 0. (1.79) 


In the slack form, we see that the right hand side (RHS) terms in the translated 
equality constraints are non-negative, as marked in green in the above. Here the RHS 
being non-negative is the certain condition that the translated equality constraints 
should satisfy. Actually we could obtain such non-negative RHS terms by going 
through the two-step conversion: (i) converting (1.72) into (1.73); and then (ii) 
converting inequality constraints back into equality ones with slack variables. 

However, if you think about it, you may see that the RHS in the translated equal- 
ity constraint is not always guaranteed to be non-negative. For example, suppose we 
have the following inequality instead: 


—2x) — 4x2 < —34. 


In this case, if we naively apply the slack-variable trick that we did earlier, then we 
would get —34 in the RHS, violating the certain condition. Hence, to avoid this, 
we should take a slightly different slack-variable trick. We first flip the sign of both 
sides to obtain: 


2x1 + 4x2 > 34. 
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We then subtract a non-negative slack variable, say sı, so as to obtain: 
2x, + 4x2 — sı = 34. 


This way, we can ensure that all the RHSs are non-negative while sı being non- 
negative. 


Simplex algorithm: The overall procedure The simplex algorithm is an iter- 
ative algorithm. For each iteration, we do the following: 


1. Start with an initial feasible solution; 
2. Perturb the solution along a direction that maximizes the target variable z. 


Once we obtain a newly perturbed solution, we set it as another initial point in the 
next iteration, and again do perturbation along a new z-maximizing direction in 
view of the new initial point. We repeat this procedure until any perturbation does 
not increase z further. 


|teration 1 How to set up an initial solution? There could be many ways. One nat- 
ural way comes from a particular form of the two equality constraints that involve 
the two slack variables: (1.77) and (1.78). Notice that sı appears only in (1.77), 
similarly s2 appears only in (1.78); on the other hand, (x1, x2) appear both in the 
two equations. So one easily-derived feasible point is the one in which we set the 
both-appearing variables to zero, i.e., x) = x2 = 0. This way, one can readily see 
that the following is a feasible solution: 


(x1, «2, 51,52) = (0, 0, 78, 36). (1.80) 


Notice in this case that z = 0 due to (1.76). 
A natural question arises. Can we do better? To figure this out, we consider the 
equality constraint (1.76) that includes z of interest: 


z = 5x1 + 4x. (1.81) 


We see that increasing x; and/or xz (from the initial point xı = x2 = 0) yields an 
increase in z. So one may wonder which direction leads to maximizing z? There are 
three possible options that one can think of: (i) increasing x; only while maintaining 
x2 = 0, i.e., (x1, x2) = (0,0) where ô > 0; (ii) the other way around, i.e., (x1, x2) = 
(0, ô); or (iii) increasing both x; and x2, i.e., (x1, x2) = (61, 62) where 6; > 0. The 
simplex algorithm takes only the first two options. You may wonder why the last 
option is ignored — the reason will be explored in depth in Prob 2.6. 

The first option seems the z-maximizing direction because the slope 5 placed 
in front of x, is larger than the slope 4 in front of xz in (1.81). However, it is not 
that clear if taking that direction is indeed the best way to go. The reason is that 
the maximum values of 6 that we can push through can be different across distinct 
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directions. Notice that we still need to respect the feasible region induced by the 
constraints while searching for the maximum values of ô. So we need to investigate 
the two options carefully. 

First consider (x1, x2) = (6,0). The constraints of (1.77) and (1.78) then give: 
sı = 78—306 > Oands2 = 36—406 > 0, which in turn yields: ô < min{26, 9} = 9. 
So xı can be maximally set to 9 where z = 45. On the other hand, (x1, x2) = (0, ô) 
gives: s = 78 — 5 > 0 and s2 = 36 — ô > 0. Hence, ô < min{ 8, 36} = 
B, which yields z = 62.4. So from this, we see that the second option is better, 
although the slope 4 is smaller than the other slope 5. This naturally motivates us 
to choose the following feasible point: 


(x1: X2; s152) = (o z’ 0, — (1.82) 


where sı and sz are set according to the constraints of (1.77) and (1.78). The geo- 
metric picture for this is illustrated in Fig. 1.16. We move from (x1, x2) = (0,0) 
to (x1,x2) = (0, B), 


Iteration 2 We now take the solution (1.82) as an initial feasible solution. The 
question is: Can we do better than this? To check this, let us ponder (1.82) again: 


(x1: X2, 51,52) = (0. —, 0, — (1.83) 


Remember in the initial feasible point (1.80) that (x1, x2) = (0,0). On the other 
hand, in the second feasible point (1.83), variables that are set to zero are different: 


321 + 5X2 = 78 


78 
2 4 
20 
Tr ) 
4zı + 22 = 36 


Figure 1.16. A geometric insight behind how the simplex algorithm leads to the solution. 
The beauty of the simple algorithm is that it can obtain the optimal solution although it 
searches only along the edges of the feasible region. 
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(x1, 51) = (0, 0). Notice that whenever an iteration occurs, we have a new set of two 
variables being set to zeros. This is because only one variable (x2 in this example) 
is perturbed while the other (x; in this example) being zero and in the new initial 
point, one equality is obtained, thus forcing another variable (sı in this example) 
to be zero. 

Given that (x1, 51) = (0,0) in the new initial point, one can now think of two 
options for perturbation: (i) (x1,51) = (6,0); and (ii) (x1, 51) = (0, ô). You may 
wonder why other variables (x2, 52) are not taken into consideration for perturba- 
tion. Sure, we can include such additional options for perturbation. However, the 
beauty of the simplex algorithm is that only the restricted perturbation w.r.t. the 
zero-set variables suffices for us to obtain the optimal solution. Of course, it is not 
crystal clear why that is the case. The proof of the optimality of the simplex algo- 
rithm may be helpful for you to obtain a better understanding although we will not 
cover the proof here. 

In order to check which direction is better between the two options, we first 
need to ponder (1.76) to see how x) and s; affect z: 


z— 5x1) — 4x2 = 0. (1.84) 


But there is a problem here. The problem is that it is difficult to see how sı affect 
z, since sı does not appear in the equation (1.84). 

A very famous technique, called the Gaussian elimination, helps us to see the 
effect of sı upon z. Massaging (1.76) and (1.77) properly, we can cancel x2 out to 
obtain: 


5 x [z — 5x1 — 4x. = 0] 
4 4 x [3x1 + 5x2 + s1 = 78] 


> 5g — 13x) + 45, = 312. (1.85) 


This then immediately rules out the second option: (x1, 51) = (0, ô). Why? We see 
that increasing sı yields a decrease in z. So taking the first option (x1, 51) = (ô, 0) 
is the right way to go. Now the question is: How maximally can we set 6? To check 
this, let us ponder the constraints (1.77) and (1.78) again: 


3x1 + 5x2 + 51 = 78; (1.86) 
4x, +x2 +52 = 36. (1.87) 


Here (1.86) looks okay because we can immediately see how x2 is changed depend- 
ing on (x1, 51) and this helps us to easily identify the limit of ô. On the other hand, 
the form like (1.87) is not desirable because the form does not allow us to directly 
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see how s2 is changed depending on (x1, s1). Hence, it is not that simple to identify 
the limit of ô. Actually the following form is preferred instead: ?x1 +?s1+?s2 =?. 
Again the Gaussian elimination helps us to obtain the form: 


5 x [4x] + x2 + 52 = 36] 


= [3x1 + 5x2 + s1 = 78] 


> 17x, — sı + 552 = 102. (1.88) 


With (1.88) and (1.85), we can re-write the optimization problem as: 


maxz : (1.89) 

5z — 13x1 + 45) = 312 (1.90) 
3x, + 5x2 +5; = 78 (1.91) 
17x, — sı +552 = 102 (1.92) 
X1, X25 51,52 > 0. (1.93) 


Remember that our second feasible point was: 


78 =) l (1.94) 


(x1, X2, 51, 52) = (o PE. 0, S 


As mentioned earlier, we can rule out the second option (x1, s1) = (0, ô). So taking 
the first option (x1, 51) = (ô, 0), we get: 5x2 = 78-306 > Oand 5s2 = 102—176 > 
0, which then yields: ô < min{26, 102) = a Hence, we obtain: z = 78. Since 
z = 78 is strictly larger than z = 62.4 (obtained under (1.94)), this motivates us 
to choose the following feasible point: 


102 204 


’ 91> = Tae 2 carey EMA = , 12, > 1. 
(x1, X2, 515 52) ( eT 0 0) (6, 12, 0, 0) (1.95) 


where (x2, s2) are set according to (1.91) and (1.92). The geometric picture for this 
is illustrated in Fig. 1.16. We move from (x1, x2) = (0, 8) to (x1, x2) = (, 204), 
Iteration 3 Can we do better? To check this, again ponder (1.95). We now 
have the following two options for perturbation: (i) (51,52) = (6,0); and (ii) 
(51,52) = (0, ô), due to (51,52) = (0,0). To check which direction is better, again 
consider (1.90) to see how (51,52) affect z: 5z — 13x, + 45; = 312. Here it is 
difficult to see how sy affects z. Again use the Gaussian elimination to obtain the 
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following where one can see the effect immediately: 


17 x [5z — 13x, + 45; = 312] 
— 13 x [17x1 — sı + 5s2 = 102] 


>  85z + 55s1ı + 65s2 = 6630. (1.96) 


We see from (1.96) that increasing (s1, s2) yields a decrease in z, meaning that any 
perturbation does not increase z further. Hence, we stop here, obtaining: 


(1.35) = (6, 12) => z* = 78. (1.97) 


This is how the simplex algorithm works. We stop when increasing such zero- 
variables does not increase z. It turns out this way of iteration enables us to achieve 
the optimal solution in a finite number of steps. In many practical applications, 
it has been shown that the finite number of steps required is much less than the 
total number of vertices in the polytope formed by the constraints, meaning that 
the simplex algorithm arrives at the optimal point very fast. 


Look ahead In the next section, we will study how to solve LPs using a particular 
yet useful programming tool: CVXPY. We will then move onto the second instance 
of convex optimization problems: Least Squares. 


50 Convex Optimization Basics 


1.7 LP: CVXPY Implementation 


Recap So far we have studied several stuffs: (1) the concept of convex optimiza- 
tion; (2) typical categorization of convex instances; (3) why convex optimization 
problems are tractable (efficiently solvable on a computer); (4) a bunch of historical 
and classical examples that can be translated into LPs or that can be solved via LP 
relaxation; and (5) the simplex algorithm for LPs. However, there is one content 
that we missed for LP. That is, how to implement the algorithm via programming 
tools such as CVXPY that runs in a widely-used open source platform, Python. 


Outline In this section, we are going to cover the missing stuff: studying how to 
implement several interested problems via CVXPY. Specifically what we are going 
to do are four folded. First off, we will learn how to install CVXPY in Python. We 
will then investigate basic CVXPY syntaxes via a simple example. Next we will do 
some exercises for code implementation to get familiar with CVXPY. We will do 
this in the context of the two prior examples: (i) Kantorovich’s plywood cutting 
problem (see (1.47) in Section 1.4); and (ii) the toy example introduced in the 
course of explaining the simplex algorithm (see (1.74) in Section 1.6). 


CVXPY in Python There are two popular software tools depending on platforms 
that we use: (1) CVX (running in MATLAB); (2) CVXPY (running in Python). 
While MATLAB is much more user-friendly and hence much easier to use, it 
requires a license. So we will use a free software, CVXPY. 

CVXPY is a library running in a recent version of Python, Version 3. 
So you should first install Version 3 by downloading it from https://www. 
python.org/downloads/. Or you can use it via virtual environment tools like 
Anaconda: 


https://www.anaconda.com/products/individual 


For more details, refer to a Python tutorial in Appendix A. 
In order to install CVXPY, you should rely upon the library manager called pip. 
Here is a command for installation: 


pip install cvxpy 


To check the list of installed libraries, you can type: 
pip list 


If you have difficulties during installation, please refer to 


https://www.cvxpy.org/install/index.html 
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How to use CVXPY? To give you a rough idea as to how to use CVXPY, let us 
give you a CVXPY script for a simple example: 


min(x, — 3x7)? : 
x 


x] — x2 < 3, (1.98) 


xı +x = 10. 


With some straightforward yet tedious calculation, you can easily figure out that the 
optimal solution is: (xf, x3) = (6.5, 3.5). Let’s do sanity check with the following 
CVXPY script: 


import cvxpy as cp 

# optimization variable 

xl = cp.VariableQO 

x2 = cp.VariableQ 

# constraints 

constraints = [x1-x2<=3, xl+x2==10] 

# objective function 

obj_min = cp.Minimize((x1-3*x2)**2) 

# set up a problem 

prob = cp.Problem(obj_min, constraints) 
# solve the problem 

prob.solveQ) 

# print the solution 

printCstatus: ’, prob.status) 
printCoptimal value: ’, prob.value) 
printCoptimal variables: ’, xl.value, x2.value) 


Here the blank parenthesis in cp.Variable() is for generating a scalar optimiza- 
tion variable. To create a vector with a higher dimension, say d, one can put the 
dimension inside the parenthesis like cp.Variable(d). Sometimes, an optimization 
variable is of a matrix form in particular in the case of Semi-Definite Program (SDP) 
that we will study in Section 1.13. In such a case, we use: cp.Variable(d1,d2) where 
(d1,d2) are the row and column sizes of an interested matrix. For equality con- 
straints, we use the symbol “==” instead of “=”. The command cp.Minimize is for 
the use of minimization, while cp.Maximize is employed for maximization. The 
command cp.Problem is an optimization problem object that takes the objec- 
tive function and constraints as input arguments. The command prob.solve() 
solves the optimization problem and prints the optimal value accordingly. The last 
three lines are for checking whether the solution is optimal as well as returning 
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the optimal value together with the optimal solution. In this case, we should 
read: 


status: optimal 
optimal value: 16.0 
optimal variables: 6.5 3.5 


Notice that the optimal solution is the same as what we calculated by hand. 

You may now have a very rough idea as to how to implement CVXPY. Of course, 
this may not be enough for scripting some problems that you may be interested in. 
So we do more exercises yet in the context of the two examples that we investigated 
earlier. 


Exercise 1: Kantorovich’s plywood cutting problem (1.47) Recall the 
problem: 


min x3 : 


0<x <1, O<m <1, 


(1.99) 
— 10x, — 20x2 — x3 < 0, 


10x; + 40x2 — x3 — 50 < 0. 


Using the syntaxes that we learned above, we can readily implement a code as 
below: 


import cvxpy as cp 
x1 = cp.VariableO 
x2 = cp.VariableQO 
x3 = cp.VariableQ 


constraints = [x1>=0, x1<=1, 
X2>=0, x2<=1, 
-10*x1-20*x2-x3<=0, 
10*x1+40*x2-x3-50<=0] 
obj_min = cp.Minimizecx3) 
prob = cp.Problem(obj_min, constraints) 
prob.solve() 
print status: ’, prob.status) 
printCoptimal value: ’, prob.value) 
printCoptimal variables: ’, xl.value, x2.value, x3.value) 


status: optimal 

optimal value: -20.00000000023447 

optimal variables: 1.0000000001581 0.4999999998885342 
-20.00000000023447 
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Actually, the optimal value is —20 and the optimal variables should read 
(x},x3,x3) = (1, 0.5, —20). So there are some minor distinctions in the values 
relative to the above numerical solutions. This is because CVXPY runs on a com- 
puter via an algorithm, incurring some numerical distinctions. 


Exercise 2: The toy example used for the simplex algorithm Recall the 
toy example: 


max 5x1 + 4x2: 
3x1 + 5x2 < 78 


(1.100) 
4x, + x2 < 36 


x1,x2 2 0. 
The code implementation is also very simple. See below. 


import cvxpy as cp 
x1 = cp.VariableO 
x2 = cp.VariableQ 
constraints = [3*x1+5*x2<=78, 
4*x1+x2<=36, 
x1>=0, x2>=0] 
obj_max = cp.Maximize(5*x1 + 4*x2) 
prob = cp.Problem(obj_max, constraints) 
prob.solveQ) 
printCstatus: ’, prob.status) 
printCoptimal value: ’, prob.value) 
printCoptimal variables: ’, xl.value, x2.value) 


status: optimal 
optimal value: 77.99999988422522 
optimal variables: 5.999999990155606 11.999999983361798 


Similarly, we see some very minor numerical differences, compared to the optimal 
value 78 and the optimal variables (xj, x}) = (6, 12). 

You may want to exercise more. Dont worry. We have prepared a couple of 
exercise problems in upcoming problem sets. So you will have some chances to do 
more. 


Look ahead Weare now done with all the contents in LP In the next section, we 
are going to move onto the second instance of convex optimization: Least Squares 
(LS). Specifically what we are going to do for LS are three folded. First we will review 
what the LS problem is. We will then present a geometric insight which can help us 
to understand what the LS solution means and therefore why the solution makes 
sense in light of our intuition. Lastly we will study one very important application 
in machine learning: the classification problem. 
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Problem Set 2 


Prob 2.1 (Total probability law) Consider discrete random variables X € ¥ 
and Y e y with probability distributions Py and Py, respectively. 


(a) State the total probability law, and prove it. 
(b) Suppose Px y is a joint distribution for (X, Y). Show that 


S Px r&y) = Px(x) Vx eğ. (1.101) 
yey 


(c) Suppose (1.101) holds. Show that Py y satisfies one of probability axioms: 


> DiPx yy) = 1. 


xEX yey 


Prob 2.2 (Ist-order Wasserstein distance) Let X e ¥ := {—2, 0,2} bea 
discrete random variable with probability distribution: P(x) = i L, J for x = 
—2,0, 2, respectively. Let Y € VY := {—4, —1, 1, 4} be another discrete random 
variable with: Py(y) = 2, $, ae for y = —4, —1, 1,4, respectively. Consider 
Monge’s problem which can be formulated as follows. Given Py and Py, 


W (Px, Py) := min E [IX — YII] (1.102) 
Px,y 


where the minimization is over all joint distributions Px y respecting the marginals 


Py and Py: 


SI Prr) =Px@) Vx e 43 (1.103) 
yey 
Si Pxy(sy) =Priy) vey. (1.104) 
xEX 


(a) Translate the above optimization problem into an LP in standard form. 
(b) Solve the optimization problem using CVXPY. Also write a script for 
CVXPY implementation. 


Prob 2.3 (Linear classification) Consider the linear classification problem 
wherein the goal is to find a boundary of the line form that can distinguish legit- 
imate emails from spams. We are given {(x, y)}”, training dataset placed in 
the file “train.csv”. The file is uploaded on 


http://csuh.kaist.ac.kr/convex_book/PS2/train.csv 
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Here x := KÊ, xP) indicates the ith example and y denotes its corresponding 
label (legitimate = +1; spam = —1). 
(a) What are m and dimension of x“? 
(4) Use Python to visualize the data points in the two-dimensional space. 
Hint: You may want to use a Python package matplotlib.pyplot. For how to 
use it, refer to Appendix A.2. 
(c) Formulate an optimization problem for the linear classifier. Solve the prob- 
lem using CVXPY. Also write a script for CVXPY implementation. 


Prob 2.4 (Bipartite matching problem) Consider the bipartite matching 
problem that assigns three people to three tasks in an one-to-one fashion. The 
matching costs are given as: 


(w11, w12, w13) = (1, 2, 4) 
(w21, w22, w23) = (1, 4, 2) 
(w31, w32, w33) = (1, 3, 3) 


where w;; indicates the cost w.r.t. the assignment of task j to person 7. The objective 
function is: 


min >) >) wax, (1.105) 


where xj; € {0,1} indicates whether person 7 is assigned to task j and N = 4 is 
the number of people (or tasks). Here the constraint is that each person must be 
assigned, and so each task must be: 


N 
> = 1 Vż e€ {1,...,N} (each person must be assigned to one task); 
j=! 
(1.106) 


N 
yu = 1 Vj € {1,..., N} (each task must be assigned to one person). 
i=1 


(1.107) 


(a) Formulate a Boolean optimization problem. Derive the optimal value p* 
and the optimal solution x*. You can do this by hand or by computer. 

(b) Formulate an LP relaxation problem. Solve this problem (deriving p*, 
and xřþ) using CVXPY. Also write a script for CVXPY implementation. Is 


Pip =P? 
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Prob 2.5 (Standard and slack forms for the simplex algorithm) Consider 
the following LP: 


minx; +x2: 2x1 +x. > 1 


IV 


xy + 3x2 > 1 


IV 


xı >0,x > 0. 


(a) Convert this into the standard form for the simplex algorithm. 
(b) Convert the standard form (derived in part (a)) into the slack form. 


Prob 2.6 (Simplex algorithm: z-maximizing direction) Consider the exam- 
ple that we investigated in Section 1.6: 


maxz: (1.108) 

Z— 5x1 — 4%. = 0 (1.109) 
3x1 + 5x2 + s1 = 78 (1.110) 
4x, + x2 +52 = 36 (1.111) 
X12, 51,52 = 0. (1.112) 


Suppose we set an initial feasible point as: 
(x1, X25 51, 52) = (0, 0,78, 36). (1.113) 


Consider three possible options for a direction along which we perturb the initial 
point: 


(i) (x1, x2) = (6, 0); 

(ii) (x1, x2) = (0, 0); 

(iii) (x1, x2) = (ô1, 02) 
where ô, 01, 02 > 0. 


(a) Suppose that ô = 6; + 62 = 1. Which is the z-maximizing direction? 

(b) Suppose that (ô, 61, 62) can be chosen arbitrarily subject to the constraints. 
Is (61, 62) = cZ, 204) a valid choice, i.e., respecting the constraints? If so, 
what is z under the choice? What is the z-maximizing direction in this case? 

(c) When taking the third option, discuss whether it is easy to find the best 
choice of (61, 62) (in a sense of maximizing z) relative to finding the best 6 
in the first (or second) option. What if we have more constraints that come 
with more slack variables? 
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Prob 2.7 (Simplex algorithm: Exercise 1) Consider the following LP: 
max 4x1 + 3x2: 2x1 + 3x2 < 6 
—3x) + 2x. < 3 
2x7 <5 
2x, +x% <4 
xı >0,x2 > 0. 


(a) Solve this using the simplex algorithm that we learned in Section 1.6. You 
should do it by hand. 
(b) Solve this using CVXPY to do sanity check for your solution in part (a). 


Prob 2.8 (Simplex algorithm: Exercise 2) Consider the following LP: 


IV 


min 3x1 + 9x2 : 2x1 + x2 > 8; 


x — 2% > —l; (1.114) 


xı > 0,x2 > 0. 


(a) Convert the problem into the slack form. 

(4) Use the simplex algorithm to solve this problem. Show all the detailed pro- 
cedures as per the rule that we learned in Section 1.6. 

(c) Write a CVXPY script for solving the above optimization (1.114). 


Prob 2.9 (True or False?) 


(a) Consider Monge’s problem that we studied in Section 1.4. Let s; > 0 be 
the amount of soils mined in ground 7 € {1,2,...,m}. Let d; > 0 be 
the amount of soils demanded at construction site 7 € {1,2,...,}. In 
Section 1.4, assuming that D7") s = Di) dj = 1, it was shown that the 
problem can be formulated as an LP. In fact, without the assumption on 5;’s 
and d,’s, the problem can also be formulated as an LP. 

(b) The LP relaxation technique yields the exact solution for the shortest path 


problem. 
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1.8 Least Squares (LS) 


Recap So far we have studied many contents ranging from the concept of convex 
optimization to all the details for LP including CVXPY implementation. 


Outline In this section, we will move onto the second instance of convex opti- 
mization: Least Squares (LS). Specifically we will cover three stuffs. First we will 
review what the LS problem is. We will then present a geometric insight which 
helps us to understand what the LS solution means and therefore why the solu- 
tion makes sense in light of our intuition. Lastly we will study one very important 
application in machine learning: the classification problem. 


Review: Least-squares problem Since we studied LS in Section 1.1 (a bit far 
from here), let us start by reviewing what the problem is. The problem is formu- 
lated as: 


min ||Ax — 6||?. (1.115) 


As mentioned earlier, one of the most important things that we can benefit from 
this problem is that it has the closed-form solution: 


x* = (ATATIA b. (1.116) 


In Section 1.1, we just claimed that this is the solution. Now let us prove it. The 
first thing to notice is that the objective function ||Ax — b||? is convex. Actually this 
was dealt in Prob 1.1(b). Next, remember the optimality condition for x* that we 
learned in Section 1.3 w.r.t. unconstrained optimization. That is, x* must be the 
stationary point. By applying this, we obtain: 


V ||Ax* — ||? = 0. (1.117) 
Now consider the gradient w.r.t. x: 
V Ax — b||? = V(Ax — 6)" (Ax — b) 
= V(x) Al Ax — 2x7 AT b+ bT b) 
= 2AT Ax — 2A" 


where the last follows from the definition of the gradient w.r.t. a vector. Please 
exercise Prob 1.2 if you are not convinced with the gradient computation. Applying 
this to the optimality condition (1.117), we get: 


x* = (ATA TATL. (1.118) 
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Dimensions of (x, A, b) Let d be the dimension of the optimization variable x. 
Then, x € Rf. Let A := [a --- ag] € R”*4. Then, b € R”. We now have two 
cases depending on the values of m and d. 

One is: m < d, i.e., A is a wide matrix.’ Suppose all the row vectors in A are 
linearly independent, i.e., rank(A) = m, which usually holds in practice. In this 
case, we have a larger number d of unknowns than the number 7 of linear equations 
in Ax — 6 = 0. This implies that there are infinitely many solutions that respect 
Ax — b = 0. So in this case, the optimal value p* = 0. In typical scenarios, the 
optimization solution x* that we seek to find is unique. For instance, consider the 
astronomy problem that we investigated in Section 1.1, wherein x* concerns the 
ground-truth trajectory of the orbit of Ceres. In such a case, x* must be unique. 
However, if m < d, there are infinitely many solutions for x*. Then, it would be 
very much unlikely that one solution among infinitely many coincides with the 
ground-truth trajectory. This is definitely not an interested scenario. 

The second case is: m > d, i.e., A is a tall matrix. Suppose that 6 does not lie in 
the range space of A, range(A) (the space spanned by all the column vectors of A). 
Actually the case b ¢ range(A) is typical in practice. Again think about the astron- 
omy problem in which @ indicates a collection of observed location coordinates 
and A is an observed-location-dependent matrix. Since observations often contain 
measurement errors subject to random directions, it is very much likely that 0 is 
not in a certain space, say range (A). Hence, 6 ¢ range(A). In the case of m > d, 
obviously there is no solution that satisfies Ax — b = 0. But what we can say is that 
it has a solution that minimizes ||Ax — b||? though, and this forms the basic idea 
behind the least-squares problem. So what we are interested in is the second case: 
m> d. 


Geometric insight We present a geometric insight behind the least-squares 
problem. From this, you will see what the least-squares solution means, as well 
as why the problem is important accordingly. 

Let us first consider the simplest setting in which d = 1. In this case, A is simply 
a single-column vector and x is a scalar. Suppose we have two vectors a) and b, 
as illustrated in Fig. 1.17(a), in which @ is not aligned with 41. This is due to the 
assumption that often holds in practice: b ¢ range(A). Notice that ||a1x — ||? is 
minimized when the vector 41x — b (marked in the blue thick line) is perpendicular 
to the direction of a1. So from this, one can interpret the least-squares solution 


4. In fact, a majority of people use the terminology like a fat matrix instead. But Prof. Stephen Boyd at Stanford 
whom I interacted with while being on sabbatical recommended the use of a different terminology: a wide 
matrix. His rationale was that the wide matrix has sort of positive nuance, while the fat matrix looks negative. 
In light of his positive attitude, let us use the “wide matrix” terminology. 
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Figure 1.17. Geometric insight behind the least-squares solutions. 


as the distance-minimizing solution. The distance-minimizing solution is obviously 
what we want. So it is sort of a good solution which well matches our natural 
demand. 

We can have the same interpretation for a slightly more general case, say d = 2. 
In this case, A = [a] a2] € R”*?. The vector Ax now lies in the plane, which 
is the range space of A: Ax € range(A); see Fig. 1.17(d). Similarly ||Ax — ||? is 
minimized when the vector Ax — b (marked in the blue thick line) is perpendicular 
to the plane, as illustrated in Fig. 1.17(4). So again one can interpret Ax* as the 
distance-minimizing solution. 


An application: Classification problem As mentioned earlier, the least- 
squares problem is a very popular and powerful problem which has played a sig- 
nificant role in the optimization field since the birth of the problem in the 1800s. 
It has been employed for addressing many important problems that arise in a wide 
variety of applications. 

In this section, we would like to put a particular emphasis on one important 
application that arises in machine learning as well as that we have already investi- 
gated earlier: the classification problem. 

Remember the classification example that we studied in Section 1.5: legitimate- 
vs-spam emails classification, in which we are given m data points {(x, y)}%,. 
Here x) indicates a feature vector. In the example, we considered a two-dimensional 
case where x := K, x) and (ac, xD) denote the frequencies of keywords 1 
and 2 that appear in the ith email, respectively. Here y is a label, indicating an 
identity of the ith email: y = +1 (legitimate email), y = —1 (spam email). 

For the above setting, we considered J/inear classifiers. For the separable case, we 
formulated an LP which intends to find a line that separates two datasets (legiti- 
mate vs. spam). For the non-separable case (which is typical in practice), we for- 
mulated a slightly different LP which finds a line that minimizes the aggregated 
margin. 

In this section, we will consider a different classifier which is based on the least- 
squares problem and therefore called: the Jeast-squares classifier. The idea of the 
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Figure 1.18. A block diagram of the least-squares classifier. 


least-squares classifier is to find a linear projector that minimizes the aggregated 
squared error. 


Least-squares classifier To see what the idea means, let us consider a block dia- 
gram for the classifier, illustrated in Fig. 1.18. The least-squares classifier is param- 
eterized by a weight vector, say w € R7. Given input x”, it computes a linear 
projection w.r.t. the weight vector w; hence, it outputs x w. You may want to 
consider a sightly more general setting where we allow for having a bias term, like 
xT w+ b. It turns out one can deal with this case easily with a slight modification 
to the classifier. This will be explored later. The way to design w is as follows. Using 
the corresponding label y®, we first compute its squared error: ||x®Tw — y|[?. 
Next compute the aggregated squared error with all of the m data points given. 


Finally we formulate an optimization problem which minimizes the aggregation: 
min >) xT w = y|??. (1.119) 
weR4 j=l 


Notice that the objective function is very similar to the one that we saw in Sec- 
tion 1.1. Yes, that is the objective function that Gauss came up with in the process 
of addressing the astronomy problem. So we can use the same simplification trick 
that Gauss did, thus obtaining: 


; x DT y — yD 7 II? 
DOF w — yO? = 
i=1 xa Ty — yi) 
; (1.120) 
„OT yA) 
= : oe : 
xT (m) 
Define 
x®T yOT 
A:= : , b:= : : (1.121) 
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Figure 1.19. A block diagram of the bias-allowing least-squares classifier. 


Using these notations, we can re-write the optimization problem (1.120) as: 
min ||Aw — bll’, (1.122) 
w 
which is the least-squares problem. So we obtain w* as: 


w* = (ATA T!AT b. (1.123) 


Bias-allowing LS classifier Consider a more general setting in which we allow 
for a bias in the LS classifier. See Fig. 1.19. In case a bias term is allowed, the linear 
classifier outputs: 


xOTw+e (1.124) 


where ¢ € R indicates the bias. We can then formulate an optimization problem as: 


min bJ xTw +c — yO|/?. (1.125) 


weR?,ceR 3 
i=l 


Using exactly the same manipulation that we did in (1.120), we get: 


7 xOTy 4+ ¢— yO] |)? 
È IO w +e —y9 = : 
i=l xT w +c — y0”) 
(1.126) 
uP 4 yO] |/7 
= "| _ 
T {be (m) 


By defining 
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we see that the optimal solution is exactly of the same formula as before: 


* 
e] = (ATA)“1AT 4, (1.127) 
c 

Question One natural question arises. Can the least-squares classifier perform 
better than the linear classifier that we developed earlier using LP? To answer this 
question, first of all, we need to know what is a proper performance measure that 


one can use. 


Look ahead In the next section, we will cover this topic in depth, thus address- 
ing the question. Specifically we will introduce a prominent performance mea- 
sure named test error. We will then study how to evaluate test error, thereby mak- 
ing a comparison between the LS classifier and the linear classifier. We will also 
study a popular technique employed for improving the test error performance: 
regularization. 
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1.9 LS: Test Error, Regularization and CVXPY 
Implementation 


Recap In the prior section, we embarked on the second instance of convex opti- 
mization: Least Squares (LS). We studied the LS problem in the context of the 
spam filter design. We showed that the spam filter design problem can be formu- 
lated as an LS optimization. The formulated LS problem reads: Given data points 


1O, yO, 


min ||Aw — b||? (1.128) 
where 
(DT y) 
A:= : b= : 
ODT (m) 


This then yields w* = (ATA)T!ATb. We also proved the closed-form solution 
using the optimality condition of unconstrained convex optimization (derived in 
Section 1.3). In addition, we extended into a general setting that allows for a bias 
term, say c E R: 


min > \|Aw — b||? (1.129) 


w:=(w,c)ER4+1 ER j=] 


where A now reads: 
xT 1 


A:= ; eRe Gey (1.130) 
xT 1 


At the end of the last section, we raised two questions: (i) Is the LS classifier bet- 
ter than the margin-based linear classifier?; and (ii) What is a proper performance 
measure? 


Outline In this section, we will answer these two questions. Specifically what we 
are going to do are four folded. First we will introduce a prominent performance 
measure called test error. We will then study how to evaluate the test error. With this 
measure, we will make a comparison between the LS classifier and the linear classi- 
fier. Next, we will study a useful technique employed for the purpose of improving 
the test error performance: regularization. Lastly we will learn how to solve an LS 
problem via CVXPY. 
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Figure 1.20. A block diagram of the least-squares classifier. 


Test data In order to introduce a proper performance measure, we first need to 
study one concept on the data that is employed for the purpose of performance 
evaluation. In fact, what we are interested in is the performance w.r.t. data which 
is never used while training an interested model. Such data is so called unseen data. 
A formal name of the unseen data is test data, as the unseen data is employed 
for the purpose of testing. In contrast, we call the data used for training train 
data. Train data is usually denoted by {(, yO: while test data is denoted 
by NES yÈ where Mrest indicates the number of examples in the test 
dataset. 


Test error The test error indicates the error computed w.r.t. such unseen data. To 
give you a mathematical definition, consider the LS classifier illustrated in Fig. 1.20. 
Here an unseen test example, say xtest is fed as an input, yielding a linearly projected 


T 
output Xost 


w*. Notice that the output is a real value, while the label ytest is binary 
e {+1, —1}. Hence, we take a transformation of x,{,,w* in an effort to make an 
apple-to-apple comparison with such a binary label. To this end, if xZ.,w* > 0, we 
declare a legitimate email, outputting jest = +1. Otherwise, we declare a spam 


email, setting rest = — 1. In other words, we take the sign of the output: 
Prest = sign(xZ pw). (1.131) 


Comparing this to the ground-truth label yrest, we compute a loss function, so 
called an 0/1 loss. It takes 0 if they coincide and 1 otherwise. Considering many 
such data points, say Mrest examples, we compute the test error as the average of 
the 0/1 losses: 


1 mMtest f , 
TestError = > 1952, yË, (1.132) 
Mrest = 


where 1{-} denotes an indicator function which outputs 1 when the event {-} is true 
and 0 otherwise. 


An example of test error computation Here is an example that demonstrates 
the test error performance of a trained classifier conducted on a test dataset. Suppose 
that the test dataset has: (1) 1139 legitimate emails; and (2) 127 spam emails. 
Applying the trained classifier, we obtain 1120 correct and 19 wrong answers for 
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g=+1 g=-1 
g=-+1 1120 19 
(legitimate) true positive false negative 
(misdetection) 
y= 32.0085 
(spam) false positive true negative 
(false alarm) 


Figure 1.21. Test error performance of a trained classifier conducted on a test dataset. 


1139 legitimate emails. For 127 spam emails, we have 95 correct and 32 wrong 
answers. See Fig. 1.21 for the summary of the results. 
Then, the test error is computed as: 


1 2 
TestError = eves x 4%. (1.133) 
1139 + 127 


Four types of interested events and two types of errors The test error is 
categorized into two types depending on what the ground truth is, and a different 
emphasis should be put on the two types depending on applications. To explain 
what it means, let us first introduce four types of interested events to understand 
the different types of errors. The first is the true positive case which indicates the 
event {y = +1|y = +1}. The second refers to another desirable situation: the true 
negative case which indicates {y = —1|y = —1}. 

Now the third and fourth are relevant to error events. The third is the false neg- 
ative (or misdetection) case indicating {y = —1l|y = +1}. The third is the false 
positive (or false alarm) case, which refers to {y = +1|y = —1}. The first type of 
error is concerned about the false negative case and therefore called the false negative 
rate (FNR): 


FNR := P{ = —1|y = +1}. (1.134) 


The second type of error refers to the false positive case, so it is called the fake 
positive rate (FPR): 


FPR := P{y = +1 |y = —1}. (1.135) 


On a side note: The above naming (e.g., FNR and FPR) may be a bit confused 
to some readers. One way to remember the naming is as follows. If the test result 
is +1 (or —1), we say positive (or negative). If the prediction is correct (or wrong), 
then it reads true (or false). 
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In the above example exhibited in Fig. 1.21, the two types of error can be com- 


puted as: 
19 
FNR = ———— 7 1.7%; 
19 + 1120 
R = —— 7 25%. 
32 +95 


Notice that the two errors are highly imbalanced; one is much smaller than the 
other. If you think about it, it is sort of a desired situation. In reality, it is crucial 
to protect against missing legitimate emails. In other words, we should be able to 
well declare legitimate emails if they are indeed legitimate, meaning that we should 
reduce FNR as much as possible. In this case, it is around 1.7%, which is more or 
less okay. 

On the other hand, we may be okay with declaring as actually spam emails as 
legitimate ones, meaning that a moderate value of FPR may be acceptable in reality. 
In this case, it is around 25%, which is more or less okay. 


Margin-based linear classifier vs. least-squares classifier Since we figure 
out the concept of the test error which is a proper performance measure, we are 
now ready to compare performances of the two classifiers: the margin-based linear 
classifier and the least-squares classifier. 

To this end, we first gather training dataset: {(x, y)}”,. We then use this 
to design the margin-based linear classifier (using LP) as well as the least-squares 


classifier. Next we test the classifiers on test dataset ESN yoyo to compute 
TestErrorjinear and TestError,eastsquares- This is how we compare performances. You 
may wonder which is better in terms of the test error measure. It turns out the 
answer depends on datasets. In other words, there is often no concrete answer like: 
one is always better than the other. In Prob 3.1, you will have a chance to check just 


one case. 


Regularization technique We are faced with an issue in applying the least- 
squares classifier if no modification is made. In reality, a data point, say x, contains 
some noise. Data points are usually obtained from measurements made by humans 
or sensors. But humans and sensors are not perfect in reality, so x definitely con- 
tains some error. This error incurs an issue: Large values of ||w* || can boost up such 
noise. 

To avoid this, we somehow want to make those values small. One way to imple- 
ment this is to minimize ||w*||?. But obviously at the same time, we want to make 
|| Aw — b||? small; otherwise, w* would be always 0 — this is obviously what we do 
not want to get. This motivates people to come up with a natural idea, which is to 
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regulate the two objectives at the same time: 


min ||Aw — bI + Aljwl||? (1.136) 


weR 


where A > 0. Notice that for one extreme case of A = 0, we obtain the conventional 
least-squared solution while for the other extreme case of 1 = 00, we get w* = 0 
in which we declare spam emails randomly. In case a prior knowledge is given, one 
can take a smarter action. For instance, if the probability of a randomly selected 
email being spam (called a priori probability) is known, then one can declare any 
randomly selected email is spam with the known probability. 

The above technique is called regularization and À is called the regularization 
factor. In the machine learning community, the regularization factor is considered 
as a hyperparameter that can be arbitrarily chosen as per our design. 


How to choose 4? One natural question that arises is then: how to choose such 
a hyperparameter Å. To figure this out, we need to understand how performances 
vary in terms of the regularization factor 4. First consider the training error which 


is defined as: 


m 


1 i : 
TrainError = — > 1 {5 A (1.137) 
Mega 


where JO = sign(x@! w*). Obviously the training error is minimized at 1 = 0 
because the case focuses only on the error induced by the training dataset. And it 
monotonically increases with an increase in A. Notice that the larger 2, the more 
we regulate, penalizing more on the training error. So we will get something like a 
blue curve, as plotted in Fig. 1.22. 


error t 


train error 


m 
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Figure 1.22. Training and test errors as a function of regularization factor 2. 
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On the other hand, the situation is different for the test error: 
Mest 


> pe # soe} (1.138) 


Mest a 
i=1 


TestError = 


When A = 0, the test error would be small but larger than the training error because 
the A = 0 case focuses only on the training error. Increasing 2, we would have a 
regularization effect, so we can make the performance less dependent on the train 
data. This then puts an indirect emphasis on the test dataset, thus yielding a smaller 
test error. But if A is too big, the classifier would be close to w* = 0 in which the 
test error would be obviously very large. So one can expect there is a sweet spot on 
A that minimizes the test error. Hence, we may obtain something like a red curve, 
as plotted in Fig. 1.22. Actually this is indeed the case. This suggests a natural idea: 
Choosing 4* that minimizes the test error. 


Validation data But you may recognize an issue here. Remember that test data is 
defined as unseen data which is never employed during training. However, finding 
such A* is included in the process of designing a model; hence, any data employed 
for finding A* is actually “seen data”. So we may need another dataset for searching 
such a hyperparameter. The data used for that purpose is validation data. It is still 
“seen data”. But it is employed for the purpose of validating the goodness of a 
trained model. In other words, it is for the hyperparameter search. One important 
note on validation data: While constructing datasets, it is important to ensure that 
the distribution of validation dataset is similar to that of test dataset as much as 
possible. This is because hyperparameters (tuned due to validation data) are desired 
to be set so as to yield a good performance w.r.t. unseen test data. 


How to solve the regularized problem? Going back to the regularized least- 
squares problem (1.136), how can we solve the problem? If you think about it, this is 
nothing but another least-squares problem. Why? Applying the same simplification 
trick as Gauss did earlier, we obtain: 


2 
lAw — bll? + Aliw? = | Wal w — fol = |A'w — I? 
where 
Avs Wa e RDX and Y := fol e RTI, (1.139) 
Hence, we get: 
min IA w — BI. (1.140) 


weR 
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So the solution would be: 
w = (ATA) ATTY, (1.141) 


CVXPY implementation We have figured out that many types of the LS prob- 
lems with (or without) regularization and/or bias terms can be cast into: 


min ||Aw — b||? (1.142) 


So constructing (A, b) from data and formulating an objective function accordingly 
is the key to CVXPY implementation. For illustrative purpose, let us dig into imple- 
mentation details under a simple example in which there are no regularization and 
bias terms, and a dataset is given by: 


2 —1 
{(x, yy" = 34:5 2 =7 
1 3 
3 7 
1},-9}, 4 |,—12 
4 2 


Here is a code for implementation: 


import cvxpy as cp 
import numpy as np 
# optimization variable 
w = cp.Variable(3) 
# construct (A,b) 
A = np.array([[L2, 3, 1], 

A2 Sil: 

Po ey 

[7, 4, 211) 
b = np.array([5,-7,-9,-12]) 
# objective function 
cost = cp.sum_squares(A @ w - b) 
obj_min = cp.Minimizeccost) 
# set up a problem 
prob = cp.Problem(Cobj_min) 
# solve the problem 
prob.solveQ) 
#print the solution 
print status: ’, prob.status) 
printCoptimal value: ’, prob.value) 
printCoptimal w*: °, w.value) 
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status: optimal 
optimal value: 71.90174092652698 
optimal w*: [-1.23989377 1.23517262 -2.36470935] 


Here a good thing about CVXPY is that the square of the matrix norm 
(the key operation that arises in the LS problem) is already implemented as 
cp.sum_squares(A @ w - b). So the only thing that you should do is to construct 
(A, 6) with numpy.array and then to plug the objective function properly. One 
can also readily check that the above w* coincides with the closed-form solution 


(ATA)“!ATB. 


Look ahead In the next section, we will study another application in which the 
least-squares problem has played a crucial role. The application is the one that arises 
in a completely different field, the medical field. That is, Computed Tomography that 
you often hear of simply as CT. 
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1.10 LS: Computed Tomography 


Recap During the past two sections, we have studied a couple of stuffs on the 
2nd instance of convex optimization, Least Squares (LS): 


min ||Ax — ||? (1.143) 


wherex e R2,4A e R”*4,6 © R”™ and m > d.In particular, we studied those stuffs 
in the context of one very important machine learning application: the classification 
problem. Specifically we developed another linear classifier which can be designed 
via an LS problem. We also discussed a popular performance measure, called the 
test error, which is instrumental in making a fair comparison between distinct 
classifiers. 


Outline In this section, we are going to study another application that arises 
in quite a different domain: the medical imaging field. The application that we 
will investigate is: Computed Tomography, CT for short. Very interestingly, there 
is a mathematical principle behind the idea of CT and the principle is based 
on LS. To see how the LS problem is related to CT, we need to understand 
the principle of CT. But to this end, we first need to understand the principle 
of X-ray which forms the basis of CT. So we are going to proceed in the fol- 
lowing sequence. We will first study what X-ray is together with its key prin- 
ciple. We will then figure out the motivating context behind the invention of 
CT. Next we will explore the idea of CT. Finally we will formulate CT as an LS 
problem. 


X-ray X-ray is a form of electromagnetic radiation. It was discovered in 1895 by 
a German physicist, named Wilhelm Réntgen (Rosenbusch and de Knecht-van 
Eekelen, 2019). The discovery opened up a new medical field, called radiology. The 
radiology is now a very well-known and well-established field in medical imaging, 
but there was no such field before the discovery of X-ray. In addition, the X-ray 
played a significant role in other areas beyond the medical field, like Physics and 
Chemistry. So the discovery won Röntgen the first Nobel Prize in Physics in 1901. 
Take a look at the very short 5-year gap between the discovery year 1895 and the 
award year 1901. It is a very rare case because it usually takes much longer time 
(around more than 10~20 years) to receive a Nobel Prize since the discovery or 
invention. 


Principle of X-ray The principle of X-ray that we will focus on was discovered 
through Réntgen’s key observation made while he was working in his laboratory. 
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While Röntgen was dealing with experimental tubes, he observed a type of uniden- 
tified radiation emanating from the tubes. The naming “X-ray” was originated from 
the nature of the unidentified radiation, since “X” typically refers to an unknown. 
And he made an interesting observation about the mysterious radiation. When pass- 
ing through an object, it absorbs photons and its energy (typically quantified as the 
intensity) is proportional to the number of photons absorbed: the more photons 
(denser), the stronger intensity. 

In fact, he did lots of experiments which support the observation, and he tested 
even on his wife’s hand, obtaining a scary picture that shows the bone structure of 
the hand.” It was a sort of the first historical medical X-ray image. The discovery 
together with such medical images made many people excited about the X-ray. In 
particular, people used the X-ray for the purpose of investigating the inside structure 
of interested objects (like human bodies) without drilling it, and this could open 
up a new medical field: radiology, which later opened up a broader field of medical 


imaging. 


Limitations of X-ray However, the X-ray has some limitations in figuring out 
an internal structure of interested objects. Usually an object of interest is 3- 
dimensional. But the X-ray can generate only a projected 2-dimensional image, so 
this gives a challenge in identifying a complicated 3D object structure. For exam- 
ple, it is hard to spot tumors behind bones. Another clear example is illustrated in 
Fig. 1.23. Here a 2D image projected on the wall looks like a human hand, but 


Figure 1.23. An example in which the projected 2D image does not well represent the 3D 
structure. 


5. At that time, Röntgen had no idea of how detrimental X-ray is to human bodies. Perhaps this may be the 
reason that his wife passed away 6 years later since the discovery. Who knows? 
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the actual 3D object is a rabbit. This implies that much of the key structure-related 
information can be lost while being projected. This clearly shows the limitations 
of X-ray. 


Invention of Computed Tomography (CT) Many people tried to solve such 
information-missing problem. A history was made in 1967 among such efforts. 
At the time, a smart way to address the problem was developed, named Com- 
puted Tomography (CT). The technique together with a computer-aided machine 
was invented by two people: one is a British electrical engineer, named Godfrey 
Hounsfield; and the other is a South African-American physicist, named Allan Cor- 
mack (Cormack and Hounsfield). In fact, this invention was not done in a coop- 
erative manner — rather it was done independently. While Hounsfield’s invention 
was slightly earlier, the credit was also fully given to Cormack, so they could be 
co-awarded the Nobel Prize in Physiology or Medicine in 1979. 


Idea of CT The idea of CT is very well reflected in the name. “Tomos” is a Greek 
word which means “projected section (or slice)”; and “graphy” means “describe”. 
So in words, it means “describing an object using slices.” Details on the idea are the 
following: 


1. Project X-ray beams onto an object from many different angles; 
2. Calculate the intensities of the projected images (slices); 
3. Use them to reconstruct (describe) the object. 


A simple example To explain what it means in detail, let us consider the follow- 
ing example in which an object of interest is comprised of four equal-sized grids — 
see Fig. 1.24. For illustrative purpose, we consider a 2D object, although many of 
interested objects are 3D. Once you grasp the idea, you will soon understand that 


2/2+0.2 


a NOISE 


1+0.1 


1-0.05 


0-0.03 


Figure 1.24. A simple grid example that well illustrates the idea of CT. 
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the idea can readily be extended to a 3D object case. The object has two black grids 
on left upper and right bottom, while having two white grids on right upper and 
left bottom. Suppose that X-ray absorbs no photon when passing through a black 
grid, i.e., the density is 0. Here we define the unit of the density as the number of 
photons per unit length. On the other hand, assume that X-ray absorbs something 
(say, the density is 1) when passing through a white grid. 

Suppose we project a horizontal X-ray beam to the upper part of the object so 
that it passes through the two upper grids. Then, it would absorb nothing in the 
black grid zone while absorbing something in the white grid zone. Since the unit 
of the density that we define here is w.r.t. the unit length, the intensity of the X-ray 
beam would be proportional to the width of the white grid zone. For simplicity, 
assume that the width is 1 unit length. Then, the intensity would be 1. But there is 
always an error in measurement, say that the measurement noise here is 0.1 (marked 
in red in Fig. 1.24). We also project another horizontal X-ray beam to the bottom 
part of the object. We then measure the corresponding intensity, say 1 — 0.05. 
Projecting two vertical downward beams, we get two measurements, say: 1 + 0.02 
and 1 — 0.1. 

Shooting a top-left to bottom-right diagonal beam to the object, we would 
absorb nothing since it passes only through the black grids, so the measurement 
would be close to 0, say 0 — 0.03. On the other hand, the other bottom-left to 
top-right diagonal beam would pass only through the white grids. Since the length 
of the passing zone is 24/2, the intensity measurement would be close to 24/2, say 
2/2 + 0.2. 

What we want to figure out are the densities of the four grids, so let us denote 
those by unknown variables, say dı (top left), d2 (top right), d (bottom left), d4 
(bottom right). Using these notations, we can express the above six measurements 
as the following linear equations: 


d +dh=1.1 

d3 + d4 = 0.95 

di + dy = 1.02 

d + d4 = 0.9 ee) 


J 2d, + V2d4 = —0.03 
J 2d, + J 2d3 = 2/2 + 0.2. 


Least squares formulation Notice in (1.144) that we have six equations while 
having four unknowns. So there is no solution in general. This is indeed the 
no-solution case. But we can invoke Gauss’s idea to address this case. In other words, 
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we can formulate it as an LS problem. Defining 


1 1 0 0 

0 0 1 1 ae 
0.95 

1 0 1 0 1.02 

A:= and 6 := ; 
0 1 0 1 0.9 
—0.03 
2 0 0 2 
a ve 2/2 + 0.2 
0 y2 V2 0 
we can formulate it as: 
min ||Ad — b||? (1.145) 


Then, the solution would be d* = (A’A)~!A7O. For the above example, we 
obtain: 


(dt, dy, d3, dj) = (0.0400, 1.0613, 1.0463, —0.0950). 


The solution makes an intuitive sense, as it is close to the ground truth (0, 1, 1, 0). 


A more realistic example In fact, the example in Fig. 1.24 is too simple. In 
reality, an object is of an arbitrary shape and also the density of the object con- 
tinuously changes over regions. To understand how to apply the idea into a more 
realistic object, let us consider another example in Fig. 1.25. 

Here what we want to figure out is the density, say d(x, y), which indicates the 
density at a coordinate (x, y). Suppose we project to the object an X-ray beam (with 
a bottom-left to top-right diagonal direction), as illustrated in Fig. 1.25. Let ż be 
the length of the beam trajectory from the starting point (xo, yo) at ż = 0. Let 0 be 
the angle of the beam in reference to the x-axis. Then, the coordinate p(t) that the 


intensity: 


b= J d(p(t))dt + i 


noise 


X-ray beam 
x 


Figure 1.25. A more realistic example for illustration of CT idea. 
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beam indicates when the beam length is ¢ would be: 
p(t) = (x0, yo) + t(cos O, sin 0). (1.146) 


Remember that the intensity is: (density) x (length that the beam traverses). 
Here the density changes over the regions that the beam swipes. So it can be repre- 
sented as d(p(t)). And the length of the beam w.r.t. the very small region where the 
density is almost constant can be represented by dt. So the intensity measurement 
would be: 


b= [dowd +o (1.147) 


where v indicates a measurement noise. 


Discretization A question arises. How to estimate the density d(p(t)) only from 
such measurement (1.147)? Specifically, one can ask: How is it related to the LS 
problem which does not deal with such integral-involved term? The idea is applying 
the discretization illustrated in Fig. 1.26. We make many minuscule grids in the 
space so that the density for each tiny grid can be assumed to be constant. Let d; be 
the density of the ith grid. Denoting by a; the length of the beam traversed at the 
ith grid, we can approximate the intensity absorbed through the 7th grid as a;dj. 
Letting Speam the set of the indices of the grids that the beam travels, we can v 
approximate the aggregated intensity measured as: 


bx Š aid; +0. (1.148) 
1ESbeam 
| 
length| of the| beam traversed {at theith grid ie d t 
ash fates) ae 
m Do adi 
dy|d4dgldalds TES beam 


set ofigrid indices that the 
beam|travels 


Figure 1.26. The discretization idea for CT. 
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Now shooting many X-ray beams from many different angles, we obtain the 
following measurements: 


by © 5 aid; + vi 
1ESbeamı 
b2 © > aid; + v2 
1€Speam2 (1. 149) 


bm © 2 aidi + Vm- 


7€Speam-m 


LS formulation Notice in (1.149) that we have m equations and the number 
of unknowns is the same as the number of grids that the object spans. With a 
sufficiently large number m of measurements (this is subject to our design), we can 
make m always larger than the number of unknowns. And for this setting, we can 
again apply Gauss’s idea to formulate an LS problem as follows: 


min ||Ad — b||? (1.150) 
where 
{aihe Sosan bi 
7 [ihia Seana = b2 
{4;} i€Speamm bm 


Here {4;} i¢Speam) 20 Ngria-dimensional row vector wherein we read a;’s for 7 € Sbeami 
while 0 otherwise, and 7grig denotes the total number of grids. For instance, if 
grid = 10 and Sbeamt = {1, 5, 9}, then 


{47} iE Seam = [a 000 45 00 0 ag o]. (1.151) 


History of CT scanners This is the idea that Hounsfield came up with. While 
he mimicked Gauss’s idea, we believe that the way to mimick is highly non-trivial. 
Applying this idea, Hounsfield could also develop a prototype CT scanner in 
1971 (Beckmann, 2006); see the first left picture in Fig. 1.27. 

Remember that he was an electrical engineer — he was good enough to build an 
electrical computer-aided machine. The prototype supported m = 160 measure- 
ments (X-ray beams). The scanning time for each beam was a little over 5 minutes. 
So the total scanning time was around 13 hours. Also, the computation time for 
reconstructing an object with measurements (solving the LS problem) was around 
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Prototype CT scanner A historical EMl-scanner CT scanner nowadays 


Figure 1.27. History of CT scanners. 


2.5 hours on a computer that he had. So it could not be commercialized as it took 
lots of time. 

Fortunately, at that time, Hounsfield was working at a big and supportive com- 
pany: EMI (Electric & Music Industries). While EMI was a music-record company, 
it was rich enough to invest some money to a field which has nothing to do with 
the music industries. Actually EMI was even going further. There was a rumour 
that with tons of money from the sales of The Beatles records in the 1960s, EMI 
helped fund the development of CT scanners. Anyhow the fact is that in the same 
year 1971, Hounsfield could develop the first commercial CT scanner with gener- 
ous support from EMI, named the EMI-scanner (Bhattacharyya, 2016) — see the 
middle picture in Fig. 1.27. The performance of the scanner was remarkable rela- 
tive to the prototype scanner. The scanning and reconstruction times were around 
4 minutes and 7 minutes, respectively. So it could be commercialized. 

CT scanners nowadays are beyond remarkable. For example, Siemens CT scan- 
ner (2017 model) in Fig. 1.27 takes only ~ 0.33 seconds for the whole process. 


Look ahead So far, we have studied two instances of convex optimization: LP 
and LS. In the next section, we will study another instance which subsumes LP and 
LS as special cases: Quadratic Program. 
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Problem Set 3 


Prob 3.1 (Margin-based linear vs. Least-Squares classifiers) Consider 
the legitimate-vs-spam email classification. We are given {(x, y)}”_, training 
dataset. This is the same as that in the file “train.csv” that we employed in Prob 2.3. 
Remember the file is placed in: 


http://csuh.kaist.ac.kr/convex_book/PS2/train.csv 


In this problem, you will build up two classifiers: (i) the margin-based linear classi- 
fier; and (ii) the Least-Squares classifier. You will also compare the test error perfor- 
mances with test dataset (2. Pye. The test dataset is another file “test.csv” 
uploaded on: 


http://csuh.kaist.ac.kr/convex_book/PS3/test.csv 


You need to write a script for CYXPY implementation. 


(a) Formulate an LP for the margin-based linear classifier. Solve the problem 
(using the training data) to design the classifier. 

(4) Formulate an optimization for the least-squares classifier. Solve the problem 
to design the classifier. 

(c) Use Python to compute the test errors of the two classifiers designed in parts 
(a) and (6). Which classifier is better in terms of test error? 

(d) State the two types of errors together with their definitions. Also explain 
which type should be further minimized relative to the other in this prob- 


lem. 


Prob 3.2 (Bias-allowing least-squares problem) In this problem, you will 
design least squares classifiers which allow for bias terms, for legitimate-vs-spam 
email classification. Suppose that a linear classifier reads: 


y=x'wte (1.152) 


where x € R? and y € R denote the input and output of the classifier, respectively. 
Also w € R? and c € R indicate the linear weight and bias, respectively. Let 


ory 
DT 
A:= i e R”*0+D), 


x™T ] 
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Cn) 


where {(x, y)}”_, indicate the training dataset employed in Prob 3.1. You need 
to write a script for CVXPY implementation. 


(a) Formulate the bias-allowing least squares problem. Use the training dataset 
to design a classifier. 

(b) Use the trained classifier together with the test dataset in Prob 3.1 to com- 
pute the test error. Also compare its performance to that of the ordinary 
least squares classifier designed in Prob 3.1. 


Prob 3.3 (Regularized least-squares problem) In this problem, you will 
design regularized least-squares classifiers for legitimate-vs-spam email classifica- 
tion. Use the same datasets employed in Prob 3.1. You also need to write a script 
for CVXPY implementation. 


(a) Formulate a regularized least squares problem. Use the training dataset to 
design classifiers for different regularization factors A. You may want to set 
up a range of J like: 


import numpy as np 
lambda_ = np.logspace(-2,3,51) 


(b) Define the training error. Use Python to plot the training error as a function 
of A. Also explain the shape of the curve and describe why. 

(c) Define the test error. Use the classifiers (trained in part (a)) together with 
the test dataset to plot the test error as a function of A. Also explain the 
shape of the curve and describe why. 


Prob 3.4 (2nd-order condition of convexity) Suppose f : R? = R is twice 
differentiable, i.e., its second derivative V7f (also called Hessian) exists at each 
point in domf. In addition to the 1st-order condition of convexity that we proved 
in Prob 1.5, another well-known fact w.r.t. convexity is: f is convex if and only if 


domf is convex; 


(1.153) 
Vf (x) is positive semidefinite, i.e., V f(x) +0 Vx € domf. 
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This problem explores the proof of this via the following subproblems. 


(a) State the definition of a positive semidefinite matrix. 

(b) Suppose d = 1. Show that if f(x) is convex, then (1.153) holds. 
(c) Suppose d = 1. Show that if (1.153) holds, then f(x) is convex. 
(d) Prove the 2nd-order condition for arbitrary d. 


Prob 3.5 (Concepts) 


(a) In binary classification, four types of events are emphasized in Section 1.9. 
What are those? Also explain the rationale behind the naming of two types 
of errors. 

(b) In the legitimate-vs-spam email classification, which type of errors do we 
need to minimize further? Also explain why. 

(c) State the definitions of hyperparameter and validation set. 

(d) What are the meanings of tomos and graphy? Describe the idea of Computed 
Tomography (CT). 

(e) In the least squares problem for CT, what is the size of the matrix A? What 
is the condition on the size? Can the condition easily be satisfied? Also 
explain why. 


Prob 3.6 (Regularized CT) Consider the LS problem that we formulated for 
CT in Section 1.10: 


min \|Ad — b||? (1.154) 


where d e R” indicates the density vector, A € R”*” denotes the measurement 
matrix, and 6 € R” is the measurement vector. Here m denotes the number of 
X-ray beams projected to an object of interest, and 7 indicates the number of grids 
that span the object. We assume that d is a vectorized version of a matrix which 
discretizes the object with 7; horizontal grids and 72 vertical grids. See Fig. 1.28 
for such vectorization. In the example, 21 = 50, m2 = 50 and n = nym = 2500. 

In practice, people use a slightly different version of the ordinary LS prob- 
lem (1.154), in an effort to exploit some prior information on d. The density vector 
d is a conversion of image information. One of the common properties of natural 
images is that it is smooth; neighboring pixel values are not very much different 
from each other. One way to enforce the smoothness is to minimize the differences 
across horizontally-neighboring pixel values, i.e., minimize: 


ldi — d? + lla — aI? 4° * + Ida — day II? 


+ lidmi — dn tall? + dete — dml? +- lhom- — dam II? 
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Figure 1.28. Vectorization of a discretized matrix for a 2-dimensional object. 


+ 
+ ldm (my —1) 41 = dn (m1) 4211" Pose ldan- 1 a (onl (1.155) 


(a) Suppose we want to express (1.155) into a matrix-vector form as follows: 
(1.155) = Dd. (1.156) 


What is Dh? 
(4) Another way to enforce the smoothness is to make sure that the vertically- 
neighboring pixel values are small, i.e., minimize the following quantity: 


ldi — doy tall? + ldm — dom 41+ 
= ldm (m—2)41 — dnm- 4111" 
+ lid = doy 4211? + lldm+2 — dom42I1?+ 


ob [dy (242 — Ain (m—1)421l? 


+ 


+ lidm — dom II? + ldm — Bm ll? ++ lda- — dmm ll- 
(1.157) 
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Again we want to express (1.157) into a matrix-vector form like: 
(1.157) = || Dd]? (1.158) 


What is Dy? 

(c) As we did before, we want to formulate a regularized LS problem which 
minimizes ||Ad — b||? and ||D,d||? + ||Dyd||* at the same time. Suppose 
we employ a regularization factor 4 which would be multiplied to || Dhd ||? + 
|| Dyd||?. Formulate such regularized LS problem — your final form should 
follow the standard form of an LS. 


Prob 3.7 (Implementation of regularized CT) We implement the regularized 
CT formulated in Prob 3.6. Suppose we project many X-ray beams to an interested 
2-dimensional image from different angles, as illustrated in Fig. 1.29. 

Let m be the number of X-ray beams projected. We discretize the 2D image with 
nı horizontal and 72 vertical grids so that the total number z of grids is 2172. We 
apply this method to obtain a measurement matrix A € R”*” and a measurement 
vector 6 € R”. This data is given in the files: (i) “CT_datal.csv” (for image 1); and 
(ii) “CT _data2.csv” (for image 2). These are uploaded on: 


http://csuh.kaist.ac.kr/convex_book/PS3/CT_datal.csv 
http://csuh.kaist.ac.kr/convex_book/PS3/CT data2.csv 


katah] AAN N 


Figure 1.29. Projecting many X-ray beams to an interested 2D image from different 
angles. 
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(a) What are m, nı, n and n? 

(b) For image 1, consider the regularized least squares problem formulated in 
Prob 3.3. For the regularization factor 4 € {0, 1, 100}, solve the problem 
to estimate the density vector d. Also visualize it. For visualization, you may 
want to use the following Python script: 

import numpy as np 

import matplotlib.pyplot as plt 
Img = np.reshape(dhat,(n_1,n_2)) 
Img = np.rot90(Img,3) 

Img = np.fliplrcimg) 
plt.imshow(Img, gray’) 
plot.showQ 


where dhat indicates the estimated density vector. Also write a script for 
CVXPY implementation. 
(c) Repeat part (4) for image 2. 


Prob 3.8 (CT) Let d(x, y) be the density of an object at coordinate (x, y). Suppose 
we project to the object an X-ray beam. Let ¢ be the length of the beam trajectory 
from the starting from (xo, yo) at t = 0. Let 8 be the angle of the beam in reference 
to the x-axis. Let v be an additive noise in the intensity measurement. Express the 
intensity measurement in terms of d(-,-) and v. Then, explain the discretization 
idea (that we learned in Section 1.10) to express an approximation of the intensity 
measurement. Moreover, use the idea of Computed Tomography (CT) to formulate 
a least-squares problem. Specify a condition on the size of the matrix (multiplied 
to the density vector) and discuss if the condition can be easily satisfied in reality. 


Prob 3.9 (True or False?) 


(a) Consider the regularized least squares classifier that we learned in Sec- 
tion 1.9: 


min ||Aw — bl? + Aliw? (1.159) 
weR4 
where A € R”*4, m denotes the number of training samples, and 2 > 0 
indicates the regularization factor. There exists a sweet spot, say A* > 0, 
such that the test error is minimized, and this can be found via test data in 
practice. 
(b) In real applications, the misdetection error is desired to be smaller than the 
false positive error. 
(c) The regularized CT can be formulated as an ordinary least-squares problem. 
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1.11 Quadratic Program 


Recap During several past sections, we have studied two instances of convex opti- 
mization: LP and LS. For LP, we investigated a bunch of important examples that 
can be formulated as LPs or can be solved via LP relaxation. For LS, we explored 
two applications to demonstrate the power of the LS problem. 


Outline In this section, we are going to study the follow-up instance that includes 
LP and LS as special cases: Quadratic Program (QP). Specifically we are going 
to cover five stuffs. First we will study what QP is and show its convexity. We 
will then verify that QP indeed subsumes LP and LS. Next, we will investigate 
a special case of QP which exhibits a closed-form solution. The focused prob- 
lem is Constrained Least Squares. In the fourth part, we will discuss how to deal 
with general QP yet in a very brief manner. Lastly we will investigate CVXPY 
implementation. 


Quadratic Program The standard form of QP reads: 
min wlx + x! Qx ; 
Ax—b<0 (1.160) 
Cx—e=0 


where Q = QT e R?** > 0 isa positive semi-definite (PSD) matrix. We say 
that a symmetric matrix, say Q = QT € R@™%, is positive semi-definite if v” Qu > 
0, Yv € R”. Equivalently, one often uses the following condition instead: All the 
eigenvalues of Q are non-negative. One can prove the equivalence between the two 
conditions, relying upon eigenvalue value decomposition (w.r.t. Q) together with 
some manipulation. Please exercise by yourself if you are not convinced. The PSD 
condition is simply denoted by Q > 0. Here (w, A, b, C, e) are of compatible size. 
Using the 2nd-order condition of convexity (check in Prob 3.4), one can readily 
show that QP is indeed a convex optimization problem: 


V? (w x +x! Qe) = V(w+ (Q+ Q")x) 
= V(w + 2Qx) 
=2Q>0 


where the 1st and 3rd equalities are due to the definition of gradient w.r.t. a vector 
x eR? (please check Prob 1.2); and the 2nd and last come from our hypothesis 


(Q7 =Q>0). 
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Obviously QP includes LP as a special case in which Q = 0. To check whether 
it subsumes LS, consider: 


min ||Ax — b||? = min x! (AT A)x — 267 Ax + 676. (1.161) 
xeR? xER4 ` 


First observe that b7 b does not alter the optimal solution. The second to notice is 
that A’ A is PSD. Why? Notice that 


x A! Ax = (A9 T (Ax) = ||Ax|7>0 Vx e RZ. (1.162) 


Equality-constrained LS As mentioned in the beginning, there is a special (yet 
important) case in which the closed-form solution exists. The special case is very 
similar to the ordinary LS — the only distinction is that we additionally include an 
equality constraint. Let us call such a problem equality-constrained LS: 


min ||Ax — b||? : 
xER? 
(1.163) 
Cx —e=0 


where A € R”*4,b e RI, C e RP? and e € R’. 

Obviously we are interested in the case of m > d. Why? Remember what we 
discussed in Section 1.8. Depending on the values of p and d, we can now think of 
two cases: (i) p > d; and (ii) p < d. The first is not an interesting case because in 
the case x* is simply determined solely by the equality constraint (so it has nothing 
to do with minimizing the squared error) or there is no solution. Hence, the second 
case p < d is of interest. Regarding the wide (or square) matrix C, assume that 


rank(C) = p. (1.164) 


rank ([2]) =d, (1.165) 


meaning that all the columns of é are linearly independent. Actually these often 


We also assume that 


hold in reality. Later you will figure out the assumptions (1.164) and (1.165) are 
instrumental in deriving the closed-form solution. 


Closed-form solution Consider the case in which m > d, p < d, and (1.164) 
and (1.165) hold. Under the case, the closed-form solution to (1.163) reads: 


f T 7! T 
x* = d-Components | k Bee | K ‘| (1.166) 
C 0 e 
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where d-Components(-) is an operator that takes the first d components of (-). 
Notice that the inside of the operator is an (d + p)-dimensional vector. 

In fact, once we start with the given formula (1.166) for x*, the proof is straight- 
forward, although directly deriving the formula is highly non-trivial. Here we will 
take an easy way, formally stated below. 


1. Show that if d(x*, z) € R“*+? such that 
2ATA CT) [x* 2ATb 
Pe | i ea 
then x* must be the optimal solution, i.e., ||Ax — bl > ||Ax* — bll, Yx 


subject to Cx — e = 0. 
2. Show that 


k TA 


T 
C “ | is invertible due to (1.164) and (1.165). (1.168) 


A site note: The direct derivation of the formula (1.166) requires the identifica- 
tion of the optimality condition for x* (which is non-straightforward to derive) as 
well as some concepts (e.g., Lagrange functions) that we did not study yet. Lagrange 
functions will be covered in Part II. Hence, this derivation is omitted herein. But 
the detailed derivation will be given in Section 2.2. You will also have a chance to 
get some sense of the direct derivation in Prob 4.2. 


Remark Prior to proving (1.166), let us say a few words about the interested 
matrix that appears in (1.166): 


T T 
re al (1.169) 


C 0 


This is a very famous matrix, named the KKT matrix. Why do people name it 
KKT? Because it is the key matrix recognized by three prominent scholars: Karush, 
Kuhn and Tucker. Kuhn is famous as a friend of John Nash, who received the Nobel 
Prize in economics for the game theory and is a model for the main character in 
the movie Beautiful Mind. Tucker is famous as a PhD advisor of John Nash. In 
fact, they recognized the matrix in the process of deriving necessary conditions for 
optimality of general optimization problems, named KKT conditions that you may 
hear of. The KKT conditions are very important conditions that form the basis 
of strong duality that we will study in Part II. So we will discuss more on this 
later. 

A side note: The KKT conditions were publicized in a conference paper by 
Kuhn and Tucker in 1951 (Kuhn and Tucker, 2014). But later it was revealed 
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that the same conditions were already derived in the master thesis by Karush in 


1939 (Karush, 1939). 


Proof: (1.167) => ||Ax— bll? > ||Ax* —4||?, Yx: Cx—e=0 As mentioned earlier, 
the proof is straightforward. Consider 


Ax — bll? = || (Ax — Ax*) + (Ax* — DI? 
= ||Ax — Ax*||? + ||Ax* — dl]? + 2(Ax — Ax*)? (Ax* — b) 
2 Ax — Ax"? + lAs" — al? 
l4x* — al)’. 


IV 


The only thing that remains to complete the proof is to show the step (a) in the 
above. See below for the proof: 


2(Ax — Ax*)? (Ax* — b) = 2(x — x*)7AT (Ax* — b) 


2 =(«—x*)' C's 
= —(Cx — Cx*)"z 
2 —(e — e)z =0 


where (b) comes from the fact that 2AT Ax* — 2ATb = —C'z, which is the first 
component in (1.167); and (c) is due to Cx* = e, which is the second component 


in (1.167). 


Proof of (1.168) The proof idea is by contradiction. Suppose 


2ATA CT 
C 0 


| is not invertible. (1.170) 


Here not being invertible means that any column in the matrix in (1.170) can be 
expressed as a linear combination of the other columns in the matrix. This implies 


that 
x 2ATA Cl Ix 
ier je vary 


This then gives: 


2A’ Ax + CTZ = 0. (1.172) 
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Multiplying x7 to both sides from the left, we get: 

a Maen C20. (1.173) 
Since Cx = 0 due to (1.171), #7C’ = 0. Applying this to (1.173), we get: 
|| Ax||? = 0, which then yields: Ax = 0. This together with Cx = 0 gives: 


éj- 0. (1.174) 


Recall one of our assumptions made earlier (1.165). This implies that all the 


columns of É are linearly independent. So (1.174) must imply that 


x=0. (1.175) 
Applying this to (1.172), we get: 
CTz=0. (1.176) 


Again recall the other assumption made earlier (1.164). This implies that all the 
rows of C (i.e., all the columns of C7) are linearly independent. Hence, 


z=0. (1.177) 


This together with (1.175) yields contradiction with (1.171), thus completing the 
proof (1.168). 


General QP Recall the general QP taking the following standard form: 
min wlx + x! Qx ; 
Ax—b<0 (1.178) 
Cx—e=0 
where Q = Q7 > 0 is a PSD matrix. Now how to solve the general QP? Unfor- 
tunately, there is no closed-form solution in general. As mentioned in Section 1.3, 


strong duality provides algorithmic insights. So we will study how to solve the prob- 
lem later when dealing with strong duality in Part II. 


CVXPY implementation While we will cover the algorithm in Part II, here we 
investigate how to use CVXPY for solving QP. CVXPY implementation depends 
highly on the standard form (1.178). So the key procedure includes: (i) construct- 
ing the interested matrices and vectors (i.e., Q, w, A, b, C, e); and (ii) formulating a 
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problem object using QP-tailored built-in functions properly. For illustrative pur- 
pose, we consider a simple example: 


26 3 10 1 
Q=|3 10 1 |,w=)]3], 
10 1 2 


aN 
I 


[1 -4 3],6=10,C=[2 -1 2], e=3. 


See below a code for implementation: 


import cvxpy as cp 
import numpy as np 


# optimization variable 

x = cp.Variable(3) 

# construct (Q,w,A,b,C,e) 

Q = np.array([[26, 3, 10], 
[3, 10, 1], 
[10, 1, 6]]) 

w = np.array([1,3,2]) 

A = np.array([[l, -4, 3]]) 

b = np.array([10]) 

C = np.array([[2, -1, 2]]) 

e = np.array([3]) 


# objective function 

cost = cp.quad_form(x, Q) + w.T @ x 
obj_min = cp.Minimize(cost) 

# constraints 

constraints = [A @ x <= b, C @ x == e] 
# set up a problem 

prob = cp.Problem(obj_min,constraints) 
# solve the problem 

prob.solve() 

#print the solution 

printCstatus: ’, prob.status) 
printCoptimal value: ’, prob.value) 
printCoptimal x*: ’, x.value) 


status: optimal 
optimal value: 10.573694029850747 
optimal x*: [-0.274253573 -0.55223881 1.49813433] 


Here we use a built-in function, cp.quad_form(x,Q), to compute x! Qx. wa @ x 
indicates w7 x. The syntax for constraints is also simple. 
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Convex Optimization 


Second-Order Cone Program 
(SOCP) 1994 


Semi-Definite Program 
(SDP) 1994 


Figure 1.30. Hierarchy of convex optimization problems. 


Look ahead So far, we have studied three instances of convex optimization: LP, 
LS and QP. In the next section, we will embark on the follow-up instance: Second- 
Order Cone Program (SOCP). Recall the hierarchy of convex optimization prob- 
lems in Fig. 1.30. 
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1.12 Second-order Cone Program 


Recap So far we have studied three instances of convex optimization: LP, LS 
and QP. In the last section, we studied QP, investigating a special case of QP in 
which there is a closed-form solution: the equality-constrained LS. In particular we 
emphasized that the KKT matrix that appears in the closed-form solution is one of 
the components that arises in the KKT conditions that we will study in depth in 
Part II. 


Outline In this section, we are going to study a follow-up instance that includes 
LP, LS and QP as special cases: Second-Order Cone Program, SOCP for short. 
What we are going to do are four folded. First we will study what SOCP is together 
with the verification of the convexity of the problem. We will then demonstrate 
that it subsumes LP and QP. Next, we will discuss applications in which SOCP 
plays a role and therefore one can see the efficacy of the problem. Lastly, we will 
investigate CVXPY implementation. 


Second-Order Cone Program (SOCP) The standard form of SOCP is as 
follows: 


min w’ x: 
xeR4 


Aix — bil < efx te, i€ {l,...,m}, (1.179) 


Ix =g 


where A; e R*%*4, b; e R, c e R%,¢; e R, F e R?*4, and g € R’. The 
operator || - || indicates the Euclidean norm. By convention, people use the expres- 
sion cf x + e; instead of clx — e; (which is consistent with A;x — 6;). Here the 
complicated-looking inequality constraint is the one that you have never seen. Let 
us first verify that the problem belongs to convex optimization. To this end, we 
need to show that the left-hand-side function in the standard form (in which the 
right-hand-side is 0 with the “<” type inequality) is convex: 


\|Aix — bill — cf x — e;. 


Notice that the latter term —c} x — e; in the above is affine and also the inside term 
of the Euclidean norm is affine. Since convexity preserves under addition and affine 
transformation, it suffices to show that ||x|| is convex. In the one-dimensional case, 
the function ||x|| is “V-shaped, so it is convex. It turns out it is the case for an 
arbitrary dimension. Please check this by yourself. 
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T2 


Tı 


Figure 1.31. Illustration of the second-order cone: {(x, £) € R! : |x| < 4 and d =2. 


Why do we call SOCP? As you may guess, the naming comes from the never- 
seen inequality constraint: 


||A;x — 8;|| < clx + ej. (1.180) 


To see the rationale behind the naming, let us consider a very simple setting which 
can give a hint: A; = Z, b; = 0,c; = 0,e; = t. In this case, the constraint is 


simplified as: ||x|| < ¢. Now consider a set of (x, £) € R! subject to the con- 
straint (1.180). 


C := {(x,t) E R! : xl < e. (1.181) 


Take a look at the shape of the set, illustrated in Fig. 1.31. It looks like an ice-cream 
cone. Also the norm that appears in the set is the Euclidean norm, which is the £2 
norm. Hence, it is called the second-order cone (SOC). Another names are quadratic 
cone, ice-cream cone or Lorentz cone. 

Since the constraint (1.181) is a special case of the original constraint (1.180), 
you may still wonder why the problem (1.179) is called SOCP. Here a key obser- 
vation is that the set of affine transformation of x is an SOC: 


(Ajx — bi, clx +e) EC. 


Since convexity preserves under affine transformation, one can also view the original 
constraint (1.180) as an SOC upto affine transformation. So one can interpret the 
problem (1.179) as an SOC-constraint-based Program, which can be simply called 
SOCP. 


Subsumes QP: QP > SOCP Let us show that the problem (1.179) includes LP 
and QP as special cases. One can immediately see the inclusion of LP by setting: 


A; =0 Ve {l,...,m} 


in the original problem. 
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The proof of the inclusion of QP is slightly involved. To this end, we will show 
that any QP can be cast into the form of SOCP. So let us start with the standard 
form of QP: 


min w?x + x! Qe ; 
Ax—b<0 (1.182) 
Cx—e=0 
where Q € R?x4 > 0, A e R”*%4 and C e R?*4, Here what is annoying is 
the quadratic term x! Qx that appears in the objective function. In an effort to 
translate the annoying term into an affine term, let us first manipulate the matrix 


Q via eigenvalue decomposition (EVD). Since Q is symmetric, one can apply EVD 
to obtain: 


Q = VAUT 


where U € R?*/ is a unitary matrix (i.e, UTU = I) and A is a diagonal matrix: 
A = diag(21,. . . , Ad) where 4; indicates the żth eigenvalue of Q. Now we define 
y= Q!?x where 


Q!/? = UA2yUT 


where A! := diag(./21,...,./Aq). This definition is valid because /;’s are non- 
negative due to Q > 0. This then yields: yy = xT Qe. Introducing this new 
variable into (1.182), we get: 


min w?x t+y7y : 
x, y 


Ax—b<0 

(1.183) 
Cx—e=0 
y= Qe. 


While the newly introduced constraint y = Q!/*x is okay as it is affine, the 
quadratic term y/y in the objective is still problematic. To translate this into an 
affine term, we introduce a new variable, say ¢, such that 


t> yTy. (1.184) 


Here the key observation is that minimizing ¢ is equivalent to minimizing y?y. 
Why? By minimizing ż, one can set the upper bound of y” y smaller, hence one can 
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reduce y’ y upto the limit. Similarly by minimizing y7 y, we can choose t (our opti- 
mization variable) so that it exactly touches upon the minimized y7y. Therefore, 
we can replace y?y in the objective with t while adding the constraint (1.184), thus 
obtaining: 


min wl x+ t: 


X, J, t 
Ax—b<0 
Cx—e=0 (1.185) 
y= Vx 
yy St. 


Is yTy < tan SOC constraint? Is the newly introduced constraint y?y < tis an 
SOC? At first glance, it looks not the case, as it can be represented as: ||y|| < /7. 
Notice that ./¢ is not affine in t. But it turns out it can be the case with some 
modification. To see this, let us first manipulate it as: Ally||? < 4t. An interesting 
trick is to represent 4¢ as (t + 1)? — (t — 1)?. This then yields: 


Allyl? + @- 1)? < +1). 


Observe in the above that we have square exponents in every term. What does this 
remind you of? It reminds you of the Gausss trick that we have seen multiple times: 
representing the sum of squares as the Euclidean norm of a stacked matrix. In other 
words, we employ the trick to obtain: 


[2 


Dropping the square exponents in both sides and then using the defini- 
tion (1.181) of SOC (together with the non-negativity of t), we see that the set 


of (|, ? i „t+ 1) (affine transformation of the variables) is an SOC: 


(2J) 


Hence, we obtain the following SOCP (from QP): 


2 
< (+1). 


min wlx + f: 
x, Jat 


Ax—b<0 
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Cx —e=0 
y= Q!?x (affine) 


[2 


Applications You may wonder why we care about SOCP. One obvious reason is 


<t+1 (SOC). (1.186) 


that it has many applications like LP and LS. In particular, here we highlight two 
specific settings in which the problem is instrumental. 

The first setting represents practically-relevant scenarios in which there is uncer- 
tainty in data and/or parameters. For example, in the legitimate-vs-spam email clas- 
sification, data points can be viewed as sort of random quantities. It turns out that 
taking this probabilistic aspect into account, one can modify the original LP (that 
we formulated in Section 1.5) into an SOCP In fact, such a modified LP is cat- 
egorized into a broader class of LPs, called robust LP, which covers all the proba- 
bilistic variations of LPs (Ben-Tal and Nemirovski, 1998; El Ghaoui and Lebret, 
1997). 

Another example is the LS problem with an uncertain matrix A. For instance, 
in the CT application that we investigated in Section 1.10, the matrix A contains 
the length information of a beam trajectory that traverses small grids. Since the 
length is a measured quantity, it may contain some measurement noise, thus incur- 
ring some uncertainty. Taking this aspect, one can modify the original LS into an 
SOCP. 

The second setting concerns scenarios in which optimization problems are for- 
mulated with Euclidean norms. Examples include: (i) distance-minimizing location 
planning in which one wants to locate a warehouse so as to serve many service loca- 
tions while minimizing the transportation cost, which is usually proportional to the 
Euclidean distance (Drezner and Hamacher, 2004); (ii) image denoising wherein 
the task is to remove the noise effect on the edges of an image while incorporating a 
sort of regularization term which involves an Euclidean norm; and (iii) penalized LS 
in which one wants to minimize a noise effect while adding an Euclidean-norm- 
associated term (in the object function) for the purpose of penalizing the noise 
effect. 

Here we will not cover all of the above applications due to the interest of focus. 
Instead we are going to cover one application: robust LP. 


An example of robust LP: Chance Program (CP) (Geletu et al., 2013) 
The application that we would like to put an emphasis on is a prominent example 
of robust LP, named Chance Program (CP). For illustrative purpose, let us consider 
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a simple LP containing only one inequality constraint: 


min wlx: alx <b (1.187) 


where w,a € R? and b € R. In the legitimate-vs-spam email classification, here 
a, marked in red, indicates a data point. As mentioned earlier, such a data point 
can be viewed as a random quantity. So in this case, a can be modeled as a random 
vector. 

In an effort to deal with uncertainty, one may want to instead consider a proba- 
bilistic constraint that can be stated as: 


P(a’x <6) >1-« (1.188) 


for some small € > 0. Actually one can interpret the margin-based linear classifier as 
another probabilistic approach, since outliers are dealt with margins. The approach 
considered herein is a more direct approach that employs the probability directly in 
the constraint. The inequality (1.188) means that the constraint holds with high 


probability, e.g., with probability 1 — €. Now the question is: How to compute 
P(a’x < 6)? 


Gaussian approximation for P(a’x < b) In fact, the exact computation is 
almost impossible in reality as we have no idea of the probability distribution that 
the vector a is subject to. One way to handle is to instead approximate the compu- 
tation assuming that the vector a follows a well-known distribution in which the 
probability calculation is tractable. One such well-known distribution is the Gaus- 
sian distribution. The Gaussian distribution is not only computationally tractable, 
but it also well represents many practical settings. 

So we will use the Gaussian distribution to approximate the probability compu- 
tation. Specifically assume that the random vector a respects the Gaussian distribu- 
tion like: 


a~ N(a, K) (1.189) 


where z indicates the mean E[a] and K denotes its covariance matrix, defined as 
K := E[(a— @)(a— @)']. Here the symbol “~” means “is distributed according 
to”, and N denotes the Gaussian (or called Normal) distribution. 

Consider a linear combination of a, a’ x, which is of our interest. Under the 


Gaussian assumption (1.189), aT x is also Gaussian: 


alx ~ N (a! x, x! Kx). 
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Figure 1.32. Cumulative density function (CDF) (x) of the standard Gaussian distribution 
N(O, 1). 


Why? Check in Prob 4.3. Using this, we can compute: 


P (a! Z o) ee < in) 
x = 
_ N xT Kx Vx! Kx 


(É — a) 
=(= 
xT Kx 


where ®(-) indicates the cumulative distribution function (CDF) of the standard 
Gaussian distribution (i.e., with mean zero and variance 1): D(x) := P(t < x) 
where £ ~ N (0, 1); also see Fig. 1.32 for the illustration of the CDE 

Applying (1.190) into (1.188), we get: 


É — =*) 
cv) >1—e€. 
xT Kx 


Since ®(-) is a non-decreasing one-to-one mapping function (again see Fig. 1.32), 


(1.190) 


we can invert the function to get: 


_ <7 
2—2 DT e), 


which in turns yields: 
O7!(1 — eV x? Kx < b— ax. (1.191) 
CP — SOCP Applying (1.191) to (1.187), we obtain: 


min wx: ol” — €)V xT Kx < b—a'x. (1.192) 
xER 
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Now the question is: Is the constraint in the above an SOC? To figure this out, let 
us simplify the constraint by letting y := K'/?x. Since K is PSD, one can define 
K2, We then get: 


hie 
II S Gao) 


Here the inequality preserves since PT! (1 — €) > 0 (due to 1 — € > 5). Notice 
that the set of affine transformation of the optimization variables is an SOC: 


b- -T 
; ee EC: 
@®-!(1 — €) 
Hence, we get the following SOCP (from CP): 
min w7 x: 
xy 
y= K"/?x (affine) 


—a' x 


lyil < oia- a (SOC). (1.193) 


A side note: One can use the same idea to cover a more general scenario in which 
there are multiple data points. In such a general case, we are able to end up with an 
SOCP yet with multiple SOC constraints. Please check. 


How to solve SOCP? Like QP, there is no closed-form solution for SOCP in 
general. So as mentioned earlier, we should rely on strong duality to gain algorith- 
mic insights. Hence, we will study it in depth in Part II. 


CVXPY implementation Lastly we present how to write a CVXPY script for 


solving SOCP with the standard form: 


min wx: 
xeER4 
Ax — bill < cfx+ en i€ {1...,m}, (1.194) 


Fx =g 


where A; e RE”, b; e R“, c e RZ e; e R, F e R?*4, and g € R?. For instance, 
we consider a simple case in which m = 1 and the corresponding matrices and 


Second-order Cone Program 


vectors read: 


1 10 2 

w=|3|,4,;=] 3 5 
2 —] 1 
3 


a= |3 |,e = 4, F= [2 


Here is a code for implementation: 


import cvxpy as cp 
import numpy as np 


# optimization variable 

x = cp.Variable(3) 

# construct (w,A_1,b_1,c_1e_1F,g) 

w = np.array([1,3,2]) 

Al = np.array([[10, 2, -3], 
[5-51]; 
EE i ZAID 

b1 = np.array([2,4,7]) 

cl = np.array([3,3,1]) 

el = np.array([4]) 

F = np.array([[2, -1, 2]]) 

g = np.array([3]) 


# objective function 
obj_min = cp.Minimize(w.T @ x) 
# constraints 


soc_constraints = [cp.SOC(cl.T @ x + el, Al @ x - b1)] 
constraints = soc_constraints + [F @ x == g] 


# set up a problem 

prob = cp.Problem(obj_min,constraints) 
# solve the problem 

prob.solveQ) 

#print the solution 

printCstatus: ’, prob.status) 
printCoptimal value: ’, prob.value) 
printCoptimal x*:’, x.value) 


status: optimal 
optimal value: 0.5920529977190627 


optimal x*: [ 0.55721217 -0.46268371 0.71144597] 
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Here we can easily construct the SOC constraint with the built-in function: 
cp.SOC(cl1.T @ w + el, Al @ x- b1). 


Look ahead So far, we have studied four instances of convex optimization: LP, 
LS, QP and SOCP. We will embark on the final instance (from this book’s point 
of view) that subsumes all of the prior problems as special cases: Semi-Definite 
Program (SDP). 
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1.13 Semi-definite Program (SDP) 


Recap In the last section, we studied SOCP and figured out that the problem is 
instrumental in very practical contexts in which there is uncertainty in data and/or 
parameters. 


Outline In this section, we are going to study another instance that includes all 
of the prior problems as special cases: Semi-Definite Program (SDP). First we will 
study what SDP is and show that the feasible set in the problem is convex, thus 
proving the convexity of the problem. Next, we will demonstrate that it indeed 
subsumes LP, LS, QP and SOCP. Here we will put a particular emphasis on one 
key lemma that plays a crucial role in translating the prior problems into an SDP: 
Schur complement lemma (Zhang, 2006). We will also prove the lemma accordingly. 


Semi-definite program The standard form of SDP reads: 


min w’ x: 
xER4 


G+ Fit +x Fa > 0 er 
Cx =e 


where w € RY, C e R?X4, e e R? and G, F;s e R”*” are symmetric matrices. 
Here the inequality involves a bunch of matrices which are related with the com- 
ponents of x := (x1,x2,...,xq) in a linear manner. Hence, it is called the Linear 
Matrix Inequality (LMD. 


Proof of convexity To prove the convexity of SDP, we need to demonstrate that 
the following set induced by the inequality constraint is convex: 


S:= {x:G+xyF, +---+xgFy = 0}. 


The proof is straightforward. Suppose x,y € S. Fix 2 e [0, 1]. Let us check if a 
convex combination Ax + (1 — A)y is in the set S. So we consider: 


G + (Axi + 0 — 2y )Fi +++: + Axa + A Aya) Fa 
=1[G+ xF +---+x¢F) +A O A) [G yF + H yaFa] 


(a) 
=o Mo 


Q) 
= 0 


where (a) and (4) come from the hypothesis x,y € S; and (c) follows from the 
fact that a convex combination of two PSD matrices is also PSD. Why? Please see 
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below for the proof. This implies that the convex combination is also in the set: 
Ax + (1 — A)y € S. Hence, this proves the convexity of S, thereby showing the 
convexity of the problem (1.195). 

Proof of the step (c): For notational simplicity, denote the two interested matrices 
by A,B e R”*” > 0. Letv e R”. Multiplying this and its transpose to the 
interested linear component on right and left, we get: 


vT (AA+ (1 — A)B)v 
= Avl Av + a- A)v! Bv (1.196) 
>0 


where the inequality is due to the hypothesis of A,B > 0 and A e€ [0,1]. This 
completes the proof. 


Subsumes LP, LS and QP It is trivial to prove the inclusion of LP. Setting G 
and F;’s as diagonal matrices in the problem (1.195), one can reduce the problem 
into an LP: 


[G]; + Filza +--+ + [Fulixg = 0 i € {1,2,...,m} (1.197) 


where [G]; (or [F;] x) indicates the (7, 2) entry of G (or Fy), k € {1,2,...,d}. Here 
we use the fact that a diagonal PSD should have non-negative elements. 

As for LS and QP, showing inclusion is almost equally difficult to showing the 
inclusion of SOCP. Since SOCP is shown to subsume LS and QP, we will focus on 
proving the inclusion of SOCP. 


Inclusion of SOCP We will demonstrate that SOCP can be cast into the form 
of SDP. So let us start with the standard form of SOCP: 


min w’ x: 
xERZ 


Ax — bill < efx +e, i€ {l...,m}, (1.198) 


Ix=g 


where w € RZ, A; e RESE, b; € R‘, GE RI, eaeEceR, Fe R24, and g € R?. 
Manipulating the SOC constraint in (1.198), we get: 


(c] x + 6)? > Ax — bill. (1.199) 


Notice that c/ x + e; is non-negative for a feasible x. This is due to the inequality 
constraint in (1.198); otherwise x is not feasible. Also, without loss of generality, we 
can assume that cfx+ e; > 0. Otherwise, the constraint becomes ||A;x — 8;|| = 0. 
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Then, it becomes an equality constraint, so it can be merged into Fx = g. Hence, 
one can divide cf x + e; on both sides to get: 


Aix — bill? 


cl x +e; > T 
c x +e; 


(1.200) 


An alternative yet insightful expression of (1.200) reads: 
(cf x + ei) — (Aix — 6) {ef x t+ eee) (Ax — 6) 20 (1.201) 


where Jy,x%, denotes the k;-by-k; identity matrix. Here a key observation is that 
the left-hand-side in (1.201) is the very famous Schur complement of the matrix 
(c} x + e;)Ip,x4;- See below for the definition of the Schur complement. 


Definition: (Schur complement) Suppose that p and q are non-negative integers, 
and A € R?*?, B e R?*1, C e R77, Let 


ie aes | ‘| c ROHDE), (1.202) 


If Ais invertible, then the Schur complement S of the block A in matrix X is defined 
as: 


S:= C—BTAm'B. (1.203) 


We see the following mapping: A = (ct x+ ei)Ik;xk; (marked in blue), B = A;x— b; 
(marked in red) and C = cP x + e; (marked in green). 

In fact, there is a very famous lemma concerning Schur complement, which plays 
a key role in translating (1.201) into the standard form of inequality constraints that 
appear in SDP. That is, Schur complement lemma, formally stated below. 


Schur complement lemma: Suppose A € RP”? is positive definite, i.e., vT Av > 0 
for all v e R?. It is simply denoted by A > 0. Also suppose C is symmetric. Then, 


x= g e|=9 => S:=C—B'A'B>Y0. (1.204) 


Proof: We will present it at the end of this section. Hi 

From the mapping A = (c} x + e;)lp.x%, (marked in blue), B = A;x — b; 
(marked in red) and C = cf x + e; (marked in green), we can write down the SOC 
constraint (1.201) as an LMI of our desired form: 


F ; mab) 
(cIxtelexs, Aix {leo (1.205) 


Fx) := | Unb’ ete 
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Note that the matrix F(x) e R“@+)*@+) is symmetric and its entries are affine 
in x. The question of interest is then: Is F;(x) > 0 an LMI? It turns out the answer 
is yes. In fact, one can readily prove it by showing that all the matrices associated 
with x;’s in (1.205) are symmetric. See below for the proof. 


(ce? x+e)lixp Ax—b 
(Ax— 6)? cl x +e 
we drop the subscript index 7 in every place. Also we use a different notation, say k, 


Proof: F(x) := | | >= 0isan LMI For notational simplicity, 


to indicate the row dimension of the matrix A. We first write the interested matrix 
F(x) as the sum of two separated matrices: 


(clx+elex, Ax— b] _ [elexe —6 cf xlixp Ax (1.206) 
(Ax— 6)?  cTx+e| |- e VA a l 


Here the first matrix in the RHS is symmetric. Let A = [a] a -:: ag] € Rixd 
and c = [c1,c,...,¢/]! € RË where a; € RÉ and c; € R. Using these notations, 
one can represent the second matrix in the RHS as: 


TAT T d T d 
x A N Di1 4; xi Deel cxi 


Ei T k cixilpx h 54, T 


D Cilkxk ai 
E E 


i=l 


Cilexk 4i 
T 


Notice that the matrices | |: are symmetric for all 7. This together with 


z 


the fact that the first matrix in the RHS of (1.206) is symmetric argues that F(x) = 
0 is an LMI, thus completing the proof. 


SOCP —> SDP We have figured out that F;(x) > 0 is an LMI for all 7. So we 
have multiple LMIs, while we need a single LMI to meet the standard form of SDP. 
Here the good news is that such multiple LMIs can be merged into a single LMI. 
The trick is the following: 


AO 0 + 0 
F,(x),...5Fin(x) = 0 => F(x) := ° ao =>=0 
: 0 
0 0 F(x) 
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The proof of (1.207) is straightforward. It requires only the definition of PSD — 
please exercise by yourself. Using this, one can rewrite the problem (1.198) as: 


min wlx: 
xER4 


F(x) +0 (1.208) 
ix = g. 


Hence, we show that SOCP can be translated into SDP, thus proving the inclusion 
of SOCP. 


Recall “Schur complement lemma” Suppose A > 0 and C is symmetric. 
Then, 


X := É | 9 => S:=C—BA'BY0. (1.209) 


The proof requires the implication of two directions. Let us do it one by one. 


Proof of the forward direction Notice that the starting point is X > 0. This 
together with the definition of PSD motivates us to ponder on the following func- 
tion: 


fy) = (x? y] É 4 H (1.210) 


where x € R? and y € R7. A key observation w.r.t. the function f (x, y) allows us 
to see the relationship with the complement S of our interest, thereby proving the 
forward direction. To figure out what it means, let us first rewrite (1.210) as: 


f(xy) = x7 Ax + 2x? By +y" O. (1.211) 


Here the key observation is that f (x, y) is strictly convex in x. Why? Because of A > 0 
(from the given hypothesis) as well as the 2nd order condition of convexity. Hence, 
f(x,y) has the minimum and it is achieved at the point that satisfies Vyf (x*, y) = 0. 
A straightforward calculation then gives: x* = —A7!By. 

In an effort to relate this key observation to the complement S of our interest, 
let us define the minimum as: 


g0) = min f (x, y). 
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Plugging x* = —A~'By into the above, we can obtain: 


gO) =f) 
= (A71 By)" A(A7! By) — 2(A7! By)? By +y G 


(1.212) 
=y" (C — BTA By 


=y' Sy 
where the second last step comes from (A7!)7 = AT! (A is symmetric and 
invertible). 

This together with (1.210) enables us to easily prove the forward direction. 
Suppose that the interested matrix X is PSD. Then, the formula (1.210) implies 
that f(x,y) > 0 for all x,y. Hence, one can also say that for a particular x, 
say x*, the function is non-negative: f(x*,y) > 0 Vy. This together with 
gy) = f(x,y) = yTy (see (1.212)) yields: y7 Sy > 0 for all y. This implies 


S > 0, thus completing the proof. 


Proof of the reverse direction (converse) Given the above formulas (1.210) 
and (1.212), the converse proof is also straightforward. Suppose S > 0. Then, by 
the definition of PSD, 0 < g(y) for all y (see (1.212)). This together with the 
definition g(y) := min, f (x, y) then gives: 


O<gQ)<fy) Vey. (1.213) 
This implies X > 0 (see (1.210)), thus completing the proof. 


Look ahead One natural question is: Why do we care about SDP? One obvious 
reason is that SDP has many applications. In particular, SDP plays a crucial role in 
approximating difficult non-convex optimization problems via a famous technique, 
called SDP relaxation. In the next section, we will explore SDP relaxation in depth. 
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1.14 SDP Relaxation 


Recap In the previous section, we have studied what SDP is, and showed in par- 
ticular that it subsumes SOCP using a variety of interesting techniques. One key 
lemma that we put a particular emphasis on was: Schur complement lemma. We 
also proved the lemma. At the end, we claimed that SDP has many interesting 
applications. 


Outline In this section, we are going to discuss such applications. Specifically what 
we are going to do are four folded. First we will discuss two specific settings in 
which SDP plays a powerful role. We will then focus on one particular technique 
that is very instrumental in addressing many difficult non-convex problems. The 
technique is SDP relaxation. Next, we will study SDP relaxation in depth in the 
context of one certain problem, called the MAXCUT problem (Karp, 1972). Lastly 
we will present how to do CVXPY implementation for SDP. 


Two settings of SDP’s interest The first is the setting which concerns very 
difficult non-convex problems. As mentioned earlier, SDP relaxation plays a role in 
tacking the difficult problems. In fact, it serves to approximate them. 

The second is the problem context in which the maximum eigenvalue of a matrix 
or the nuclear norm are interested entities that we wish to minimize. Here the 
nuclear norm, denoted by ||All., is defined as: ||Al]. := X; 0;(A) where o;(A) 
indicates the ‘th singular value of A. One of the recent popular applications where 
such problems arise is: matrix completion in which one wishes to identify missing 
entries of a matrix only from partially revealed entries (Candés and Recht, 2009). 

Here we will discuss only one thing in depth: SDP relaxation. We will study 
SDP relaxation in the context of one specific yet famous problem: the MAXCUT 
problem. 

A side note: For those who forgot about the concept of singular values, let us leave 
some details. For a matrix A € R”*”, a non-negative real number ø is a singular 
value of A if and only if there exist vectors u € R” and v € R” such that Av = ou 
and ATu = ov. A way to find the singular values and the corresponding vectors 
are as follows. Let r = min(m, n). We consider ATA e R”*”. Since ATA > 0, we 
can obtain the eigenvalue decomposition as: 


AAS VEY (1.214) 


where V e R”*” is a unitary matrix and È := diag(o),...,0,) € R”*”. For 
AAT e R”*”, we can do the same thing to get: 


AAT = UX? UT (1.215) 


where U € R”*" is a unitary matrix. 
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S= {1,3,5} S° = {2,4,6} 


Figure 1.33. MAXCUT problem: Finding a set that maximizes a cut. In this example, the 
set S = {1,3,5} and the cut w.rt. the set S is w54 + w14 + w36 + w12 + w32. 


MAXCUT problem (Karp, 1972) The goal of the problem is to find a set that 
maximizes a cut. To understand what it means, we need to know about the concepts 
of three things: (i) set; (ii) weight; and (iii) cut. The context in which the problem 
is defined is a graph G which consists of a vertex set V and an edge set €. The 
problem is concerned about a undirected graph in which each edge does not have 
a direction, meaning that one edge in €, say (1,2), is the same as its counterpart 
(2, 1). For the example in Fig. 1.33, the edge set reads: 


E= 1h, 2), G, 4), (I, 5), 2, 3), 2, 4), (3, 6), (4, 5)}. 


Here what it means by a set S is a subset of the vertex set V. For example, 
in Fig. 1.33, the set is © = {1,3,5} C V. The weight is a real value that is 
associated with an edge. It is denoted by w; for an edge (7,7) € E. The cut is 
defined as the aggregation of all the weights of the edges that come across the set 
S and its complement S°. In the example of Fig. 1.33, the crossing edges are: 
{(5, 4), (1, 4), (3, 6), (1, 2), (3, 2)}. Hence the cut w.r.t. the set S is: 


Cut(S) = w54 + w14 + w36 + w12 + w32. (1.216) 


Optimization for MAXCUT To formulate an optimization problem for 
MAXCUT, we first need to come up with a proper optimization variable. Obvi- 
ously the optimization variable should serve to make a choice of a set S. Hence, 
we consider the following variable x; such that it indicates whether node 2 is in 


the set S: 


jg Le ean (1.217) 


—1, otherwise. 


Here a key observation is that when x; # xj, the edge (ż, j) (if exists) crosses the 
two sets S and S‘, and hence, this should contribute to Cut(S) by the amount of 
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wij. For (4,7) ¢ E, we set wz = 0. On the other hand, when x; = x; (meaning the 
edge (7, 7) does not cross the sets), there should be no contribution to the cut. This 
motivates us to formulate the following optimization: 


d d 1 
max» > 7G — xix) : 


E (1.218) 
x =1, ief{l,...,d} 


where d indicates the size of the vertex set. Notice in the objective function that we 
get w; whenever x; # xj; 0 otherwise. The constraint x? = 1 respects the fact that 
x; is only either +1 or —1. 


A translation technique: Lifting Observe in the objective function (1.218) 
that we have a quadratic term like x;x;. Also we have a quadratic equality-constraint. 
So these do not match the standard form of any convex instance that we have 
studied thus far. 

In an effort to translate such undesirable terms into favourable terms (e.g., affine 
terms), we introduce a well-known technique, called /ifting. Here the lifting means 
raising a space that the optimization variable lives in. In the considered example, the 
optimization variables x;’s can be represented as a vector, like: x := [x1,..., xd] T, 
So the lifting in this context is to convert the vector x into a higher dimensional 
entity, say a matrix. For instance, one may introduce a new matrix, say X, such that 
its (ż, j)-entry [X]j is defined as: 


Xij = xįxj. (1.219) 


We can then represent X in a very succinct way: 


Ayy X2 © Aig XIXI XX2 t+ XIXA 
Xi Xn > Xua XIX] XX2 MDM 
X= — 
Xdi Xd2 ` Xdd XdX1 XdX2 +++ XdXd 
(1.220) 
x] 
x2 
= [x x2 xg] = xx". 
Xd 


In fact, there is one thing that we need to worry about whenever doing change of 
variables. That is, the constraint that is newly imposed by the introduction of a new 
variable. Here a new constraint kicks in. To figure this out, notice that the matrix 
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X is of rank 1. Why? Remember the eigenvalue decomposition for a symmetric 
matrix: X = UAU". From X = xx", we see that X has only one eigenvector x. 
Its eigenvalue reads x! x, since 


Xx = (xxl )x = x(x! x) = (xT x)x, (1.221) 


T 


Obviously the eigenvalue x‘ x is non-negative. So the change of variable induces 


the following constraints: 
Xj, = 1, X = 0, rank(X) = 1. (1.222) 


Actually the above also implies that X; = 1 and X = xxl, Why? Think about it. 
Hence, with a new matrix variable X, the problem (1.218) can be rewritten as: 


d d 1 
p* = max >> z% — Xj) : 


i=1 j=1 
X;,=1, ie {i,...,d} (affine) (1.223) 
X =0 (LMI) 


rank(X) = 1 (rank constraint). 


Notice that X > 0 is an LMI. Why? For example, consider the d = 2 case in which 


Xı X12 1 0 0 1l 0 0 
X= =X X x 
K A i F IE alt IE 5 I 


It is affine in X;’s and the associated matrices are all symmetric. 


SDP relaxation Notice in (1.223) that the objective function is affine in Xj, the 
first equality constraint is affine, and the second inequality constraint is an LMI. 
However, it contains an undesirable constraint: rank(X) = 1 (rank constraint). So 
it is not an SDP. 

This is where a technique, called SDP relaxation, comes in. The idea of SDP 
relaxation is simply to ignore the rank constraint. By ignoring the constraint, the 
search space in the optimization problem becomes expanded and hence it is indeed 
relaxation. Applying the technique, we get: 


d d 
< 1 
Pspp ‘= max >) z0 — Xj): 
= (1.224) 
X;,=1, ie {i,...,d} (affine) 
X =0 (LMI). 


Obviously pé5p > p*, since it is relaxation for the maximization problem. 
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Interestingly, in many cases, the gap between p¢,, and p* is not so large. In some 
cases, the gap can be large. But it was shown by Nesterov (a Russian mathematician 
who has been playing an important role in the convex optimization field) that the 
gap is not arbitrarily large (Nesterov, 1998). The worst-case bound was shown to 


be: 


The proofs of these are out of the scope of this book. 


How to convert Xgpp into an interested vector? You may wonder how to 
convert Xðpp (obtained from (1.224)) into an associated vector of the problem's 
interest that serves to find a set. This is because X25, may not be of the following 
desired form: X2,, = xx’. In such an undesirable yet frequently-occurring case, 
one way to go is to apply a very well-known technique in statistics: Principal Com- 
ponent Analysis (PCA) (Pearson, 1901). The way it works is as follows. We first do 


eigenvalue decomposition to get: 
Xop = Udiag(A1, A2,...,Ag)UT (1.225) 


where 241} > Az > ++- > Aq > 0. Here the number of non-zero eigenvalues 


determines the rank of the matrix. So the idea is to take only the first (principal) 
largest eigenvalue 2; while ignoring others to approximate it as: 


Xž p == Udiag(A1,0,...,0)U". (1.226) 


This way, we can ensure rank(X2,,) = 1, enabling us to obtain the interested 
vector. 

A note: Notice that the solution Xp is an approximated one obtained from 
sort of a heuristic. So it has nothing to do with respecting the constraints in the 
optimization problem (1.224). As you may easily image, the approximated solution 


Xépp readily violates a constraint like X; = 1. 


How to solve general SDP? Like QP and SOCP, there is no closed-form solu- 
tion for SDP in general. So as mentioned earlier, we should rely on strong duality 
and KKT conditions to gain algorithmic insights. Hence, later in Part II, we will 
study the content in depth. 


CVXPY implementation Lastly we investigate how to write a CVXPY script for 
solving SDP. For ease of implementation, we often rely upon another standard form 
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Convex Optimization 


Least-squares Linear Program 
1800s (LP) 1939 


Quadratic Program (QP) 1956 
Second-Order Cone Program 
(SOCP) 1994 


Semi-Definite Program 
(SDP) 1994 


Very few applications in this class 


Figure 1.34. Hierarchy of convex optimization problems. 


that incorporates the trace operation and a PSD matrix: 


min trace(WX) : 
XeR”x” 


xXx>0 (1.227) 
trace(A;X) = bi 7 € {1,...,p} 


where W, A; € R”*” are symmetric matrices and 6; € R. Here the trace operation 
is defined for a square matrix, say A € R”*”: 


n 
trace(A) := > As (1.228) 
i=1 


where Aj; indicates the ith diagonal entry of A. Actually one can represent the objec- 
tive trace(WX) as a linear combination of the optimization variables X;’s (the 
entries of X). Similarly for trace(A;X). Check in Prob 4.9(f). As shown earlier, 
X > Ois an LMI. 

For code implementation, we consider the following example in which 2 = 3, 
p = 1 and the corresponding matrices and parameters read: 


34 =} 26 19 2 0 
W=]-2 18 15ļ|,4=]2 6 1 |,6,=10 
—6 15 30 0 1 22 


A code is given by: 
import cvxpy as cp 
import numpy as np 


# optimization variable 
X = cp.Variable((3,3), symmetric=True) 
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# construct (W,A_1,b_1) 

W = np.array([[34, -2, -6], 
[-2, 18, 15], 
[-6, 15, 30]]) 

Al = np.array([[19, 2, OJ, 
L e 
LONT22) 

b1 = np.array([10]) 


# objective function 

cost = cp.trace(W @ X) 
obj_min = cp.Minimize(cost) 
# constraints 
ineq_constraints = [X >> O] 


constraints = ineq_constraints + [cp.trace(A1 @ X) == b1] 


# set up a problem 

prob = cp.Problem(obj_min,constraints) 
# solve the problem 

prob.solveQ) 

#print the solution 

print status: ’, prob.status) 
printCoptimal value: ’, prob.value) 
printCoptimal X*:’, X.value) 


status: optimal 
optimal value: 6.795868612485915 


optimal X*: [[ 0.00610944 -0.0485969 0.04867967] 


[-0.0485969 0.38656275 -0.38722108] 
[ 0.04867967 -0.38722108 0.38788063]] 
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Here we use cp.trace() for the trace operation; and use the symbol “»” to imple- 


ment the matrix inequality. 


Look ahead So far, we have studied five instances of convex optimization: LP, 
LS, QP, SOCP and SDP. In fact, we have more instances which are convex but not 


belonging to the prior problems. However, there is very little application that such 


an instance plays a role in. Hence, we stop here. Instead we will focus on studying 
algorithms for general QP, SOCP and SDP, which we have deferred. So we will 


embark on Part II to start investigating strong duality. 
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Problem Set 4 


Prob 4.1 (Invertibility of the KKT matrix) Consider the equality-constrained 
least-squares problem: 


min ||Ax — b||? : Cc -—e = 0 (1.229) 


where A € R”*4 and C e R?*4, Assume that m > d and p < d. In Section 1.11, 
we showed that if 


rank(C) =p and rank D = d, (1.230) 
then the KKT matrix 
2ATA CT 
C 0 


is invertible. It turns out the other way around also holds: if the KKT matrix is 
invertible, then (1.230) holds. Prove the converse. 


Prob 4.2 (KKT equations) In this problem, you are guided to do the direct 
derivation of the formula (1.166), which we omitted in Section 1.11. For ease of 
illustration, let us echo the formula: 


T T17! T 
x* = d-Components | k iL | K ‘| (1.231) 
C 0 e 


where d-Components(-) is an operator that takes the first d components of (-), 
and (A, C, b, e) are the matrices and vectors associated with the constrained least- 
squared problem (1.229) in Prob 4.1. 

The direct derivation is based on the optimality condition that we did not derive 
yet, but which can be understood in Part II. In this problem, we will proceed by 
simply adopting the optimality condition: 


VE) T (x — x*) > 0, Vx: x = e. (1.232) 
(a) Let v = x — x*. Show that the condition (1.232) is equivalent to: 
VE) Tv > 0, W: Cv=0. (1.233) 


(b) For notational simplicity, let us use w instead of Vf (x*). Then, the condi- 
tion (1.233) reads: 


wlv>0 Ww:Cv=0. (1.234) 
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Suppose that d = 2, p = 1 and C = [1 — 1]. Prove Fact 1 below: 
Fact 1: (1.234) => w'v=0 Wo: Cv=0. (1.235) 


Hint: You may want to use the proof-by-contradiction: if w' v Æ 0, then there 
exists v such that w' v < 0 and Cv = 0. 
(c) Suppose that d = 4, p = 1 and C = [1 — 1 0 0]. Prove Fact 2 below: 


Fact 2: (1.235) => we range(C’). (1.236) 


(d) Prove Fact 1 for arbitrary (d, p, C). 
(e) Prove Fact 2 for arbitrary (d, p, C). 
(F) Using the above proofs, show that there exists z € R? such that 


2(A’ A)x* — 2A764+ C1 z=0. (1.237) 


Prob 4.3 (Gaussian distribution) Let X and Y be independent continuous 
random variables with fy(-) and fy(-) respectively. Let Z = X + Y. 


(a) Express the probability density function (pdf) f7(-) in terms of fx(-) and 
FO. 

(b) Compute the two-sided Laplace transform of fz(-) defined as Fz(s) = 
SIS ef (a)da. 

(c) Suppose that X ~ N (0, 0%) and Y ~ N(0,a¥). Compute the Laplace 
transforms of fy(-) and fz (-): Fx(s) and Fz(s). 

(d) Using the uniqueness theorem of Laplace transform (i.e., the transform is 
one-to-one mapping) argue that the inverse Laplace transform of oe is 
the Gaussian density with mean 0 and variance o”. 

(e) Consider X and Y in part (c). Let W = aX + bY where a,b + 0. Show 
that W is Gaussian. 


Prob 4.4 (Basics) 


(a) Consider a function F : R> R: f(x) = ||x||. Show that the function f 
is convex. 


(6) Show that F1, Fo = 0 if and only if 


(c) Suppose that F1, F2 > 0. Show that VA e [0, 1], 


AF, + (1 —A)Fo = 0. 
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Prob 4.5 (Linear Matrix Inequality) 


(a) State the definition of a linear matrix inequality (LMI). 
(b) Let X € R?*? be symmetric. Show that X > 0 is an LMI. 
(c) Let x,c e R. Show that 


E 
iy ~ kA =0 


is an LMI. 
(d) Let A € R**? and x € R?. Show that 


0 Ax 
=0 
Ex s] = 
is an LMI. 


(e) Let A € R”*4, x,c e RI, b e R” and e e R. Show that 


(cTx+e)I Ax—b 
(Ax b) clx+e 


is an LMI. 
(F) Represent the inequality 


||Ax — dl] < y 
with y > 0 and a variable x, as an LMI. 
Prob 4.6 (Robust LS) Consider an LS problem where A has some uncertainty: 


A= Ag + Aj6 (1.238) 


where Ay, A, € R”*4 (m > d) and 6 is a random variable with zero mean and 
a? variance. The objective function is now a random variable, as it depends on ô. 
We aim at minimizing the expectation of such a random function, which can be 
formulated as: 


min E [||Ax — 6||"] (1.239) 


x 


where 6 € R” and the expectation is taken over 0. 


(a) Show that the problem is convex. 
(4) To which class does it belong to? Also cast it into the class. 
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Prob 4.7 (Schur complement lemma) 


(a) Consider a set: 
6 x5 x3 
S= {xe R’ : xç > 0,x4 > =, x > : (1.240) 
x 5 


Is the set S convex? If so, prove it; otherwise disprove it. 


(b) Consider a function f : R° > R: 
2 
fo = —— (1.241) 
i) 


x2 — z2 
m- 


where x is defined on the set S in part (a). Is f (x) convex in x? If so, prove 


it; otherwise disprove it. 
Prob 4.8 (Generalized Schur complement lemma) Let A e R”*” and C € 
R”*”. Suppose A > 0 and C is symmetric. Suppose the eigenvalue decomposition 


of A reads: 
A=UxU! (1.242) 


where U e R”*” is a unitary matrix, i.e., UTU =I and È := diag(A},..., An). 
Define the pseudo-inverse of A as: At := UX~!UT where £7! is a diagonal matrix 


whose elements respect: 
_ Ar) if A; #0; 

ee z f > 

[2 la: p if A; = 0. we) 


Prove that 


A B 
Xi = 0 > 
É d - (1.244) 
S := C — BTA'B > 0, and Bv € range(A) Vv e R”. 


Prob 4.9 (Basics on traces) Consider a square matrix A € R”*”. Denote the 


trace of a square matrix by: 
n 
trace(A) := YA (1.245) 


i=1 


where Aj; indicates the 7th diagonal entry of A. 
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(a) Suppose A > 0. Show that A can be represented as 
A= Xx? (1.246) 


for some X e R”*”, 
(b) Show that AA’ > 0. 
(c) Consider A, B e R”*”. Show that for A > 0 and B = 0, 


trace(AB) > 0. (1.247) 


(d) Show that for A € R’*” and B e R”*”, 


trace(AB) = trace(BA). (1.248) 
(e) Show that for A e R”*”, 
trace(A) = = A;(A) (1.249) 
i=1 


where trace(A) := >) ,_, Ajj and /;(A) indicates the ith eigenvalue of A. 
(F) Let X,Z e R”*” be symmetric matrices. Show that 


> > 4s = trace(ZX). (1.250) 
i=1 j=1 
Prob 4.10 (A lemma) Suppose X € R”*” and ż € R. Show that 


[AL ce 


Y X (1.251) 
JY eR’ Z eR”; E J = 0, trace(Y) + trace(Z) < 2t. 
Here ||X ||, indicates the nuclear norm: 
min{7,n} 
X= Do o(X) (1.252) 


i=1 
where 0;(X) denotes the żth singular value of X. 


Prob 4.11 (Matrix completion) Let M e R”*” where m > n. Let Q be the set 
of (z,7)’s such that the (7,7) entries of M are revealed (observed). Suppose that the 
revealed entries are b's: 


Mj = 6, (ij) € Q. (1.253) 
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The problem of matrix completion is to reconstruct non-revealed missing entries of 
M from the revealed ones. 


(a) Suppose that m = n = 3 and the revealed entries are: 


1 2 
M=| x * (1.254) 
2 * 


* \O * 


where (*) denotes a missing value. Suppose that rank(M) = 1, i.e., M is of 
the form: M = axx? where a € R,x € RŽ. Is matrix completion possible? 
If so, perform matrix completion to find the missing entries of M. If not, 
explain why. What if rank(M) = 2? 

(4) It has been shown in the literature that one can perform matrix comple- 
tion by solving an optimization problem that minimizes the rank of the 
interested matrix M, as long as the number of revealed entries is sufficiently 
large. As a surrogate of the rank, one may often use the nuclear norm: 


IMI = >) o;i) (1.255) 


i=1 
where o;(M) denotes the th singular value of M. So the nuclear-norm- 
based optimization problem reads: 


min Mlle: My = by Gj) € Q (1.256) 


Suppose M > 0. Translate (1.256) into an SDP. 
Hint: You may want to use a trace trick that you proved in Prob 4.9(e). 
(c) Again consider the optimization problem (1.256). Now suppose M is a gen- 
eral matrix, neither symmetric nor PSD. Show even in this case that (1.256) 
can be translated into an SDP. 
Hint: You may want to use the lemma proved in Prob 4.10. 


Prob 4.12 (SDP relaxation) Consider the MAXCUT problem that we studied in 
Section 1.14. See Fig. 1.35. The weights w;;’s associated with edges are given as: 


(w12, w14, w15) = (1, 2, 6); 
(w23, w24) = (2,5); 
W36 = l; 


w45 = 1.5. 
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S= {1,3,5} S° = {2,4,6} 


Figure 1.35. MAXCUT problem: Finding a set that maximizes a cut. In this example, the 
set S = {1,3,5} and the cut w.rt. the set S is w54 + w14 + w36 + w12 + w32. 


Let x; denote whether node ż is in the set S: 


= pe xes; (1.257) 


—1, otherwise. 


(a) Formulate an optimization problem that intends to find a set that maximizes 
a cut. Derive the optimal value p* and the optimal solution x*. You can do 
this by hand or by computer. 

(b) Formulate an SDP relaxation problem. Solve this problem (derive pépp) 
using CVXPY. Also write a script for CVXPY implementation. Is p$,, = p*? 


Prob 4.13 (SOCP and/or SDP) Consider an optimization problem: 


minw? x: 
(1.258) 


a'x<b, VaeR?:\la—all’ <t 


where z € RY, b € Rand t > 0. Which class does this problem belong to among 


all the instances that you learned so far? Also cast it into the class. 


Prob 4.14 (True or False?) 
(a) Consider a Chance Program (CP): 


minw!x: P(a’x <6)>1-e (1.259) 
xER? 


where w e R%,€ e (0, 1) and ais a random vector with mean E[a] = Zand 


covariance matrix K = E[(a — 2)(a — @)"]. Consider a random variable: 


eee (1.260) 
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Suppose that the cumulative density function (CDF) of Y and its inverse 
function are given, i.e., numerical values of CDF evaluated at any point 
Y = y (and also their inverse function values) are known. Then, the CP 
can always be cast into an SOCP. 

(b) Let A e R”*”. Then, the eigenvalue decomposition of A reads: 


ASUEU (1.261) 


where U e R”*” is a unitary matrix and È := diag(/1,...,4,). 
(c) Consider the MAXCUT problem that we studied in Section 1.14. Using the 
lifting technique, we formulated the problem as: 


: 1 
peas ` yal — Xj): 


Gj)EE 

K= 1, 71€ {1,2,...,d}, (1.262) 
xX = 0, 

rank(X) = 1 


where w; indicates a weight associated with (7,7) € E; E denotes an edge 
set in a graph given in the problem; and d is the number of nodes in the 
graph. Let pžpp be the optimal value of an approximated optimization due 
to SDP relaxation. Then, there exists a rank-1 matrix X that achieves pz,5. 

(d) Suppose that F(x) € R”*” is symmetric and affine in x € R?. Then, 
F(x) = 0 isa linear matrix inequality. 

(e) Consider the MAXCUT problem that we studied in Section 1.14. Using the 
lifting technique, we formulated the problem as: 


a 1 
p = max » z% — Xj): 


Gj)EE 

Xj; = 1, zend) (1.263) 
X = 0, 

rank(X) = 1 


where w; indicates a weight associated with (7,7) € E, E denotes an edge set 
in a graph given in the problem, and d is the number of nodes in the graph. 
Let pépp be the optimal value of an approximated optimization due to SDP 
relaxation. Since the search space in the relaxed optimization is bigger and 
also the optimization is about maximization, pšþpp > p*. 
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Chapter 2 


Duality 


2.1 Strong Duality 


Recap We have thus far studied several instances of convex optimization prob- 
lems: LP, Least Squares, QP, SOCP and SDP. Actually we have one more well- 
known instance which is convex but not belonging to the prior classes. That is, 
cone programm (CP for short). In fact, understanding CP requires lots of mathe- 
matical concepts, definitions and techniques, although there are few applications. 
Hence, we will not go further along this direction. Instead we will focus on study- 
ing what we have missed so far. That is, generic algorithms that can be applied to 
general QP, SOCP and SDP. The reason that we have deferred the content is that 
algorithms for the general settings are based on strong duality and KKT conditions 
that we are supposed to cover in Part II. So from now on, we will move onto Part 
II to start investigating the contents. 


Outline In this section, we are going to cover four stuffs. Strong duality is based 
on the concepts of primal and dual problems. So we will first study what the primal 
and dual problems are. We will then study what it means by strong duality. Next, we 
will figure out the KKT conditions and the intimate connection with strong duality. 
Finally we will understand why they give insights into the design of algorithms. In 
the next section, we will study an algorithm inspired by them. 
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Convex Optimization 


Least Squares Linear Program 
1800s (LP) 1939 


Quadratic Program (QP) 1956 


Second-Order Cone Program 
(SOCP) 1994 


Semi-Definite Program 
(SDP) 1994 


ery few applications in this 


Figure 2.1. Hierarchy of convex optimization problems. 


Primal & dual problems Let us start by recalling the standard form of convex 
optimization: 


min f(x) : f(x) < 0, ¿e {1,..., m}, si 
Ax —b=0 as 
where f (x) and ff(x)’s are convex functions, A € R?*4 and p < d. Without loss of 
generality, assume that rank(A) = p; otherwise, one can remove dependent rows in 
A to make it full-ranked. The primal problem is defined as a problem that we start 
with, and hence the above is the primal problem. 

There is another problem which is intimately related to the primal problem, 
called the dual problem. But to explain what it means, we need to first know 
about a function, called the Lagrange function. The Lagrange function is denoted 
by L(x, A, v). It takes three arguments. The first is the interested optimization vari- 
able x. The second argument 4 is a real-valued vector of size m, which coincides 
with the number of inequality constraints: 2 := [J1,..., Am] T The last argument 
v (pronounced as “nu”) is also a real-valued vector yet of different size p, which is 
the number of equality constraints: v := [v),...,V,]/. The Lagrange function is 


defined as: 


L(x, A,v) =f + >! Afi) + v7 Ax b). (2.2) 


i=1 


Notice in the second summation term that f;(x) (that appears in the th inequality 
constraint) is multiplied by with 4;. Similarly the ith equality-constraint function 
is multiplied by with v; to form the last term vT (Ax — b). Hence, /;’s and v;’s are 
called Lagrange multipliers. 
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Are we now ready to define the dual problem? No. To explain what the dual 
problem is, we need to know about one more function, called the Lagrange dual 
function, or simply the dual function. Let us just use the simpler version: the dual 
function. It is denoted by g(A, v) and defined as: 


g(A,v) := min L(x, A, v) 
xEX 


. Z 7 (2.3) 
= min f (x) + D+) +v (Ax — b). 

Two things to note. The first and very important one is that the minimization 
here is over the entire space that x lies in w.r.t. f(x) and fj(x)’s: X := domf N 
domf, N - - - domf,,. Notice that the search space is not limited to the feasible set 
induced by inequality and equality constraints in the primal problem. The second 
thing to note is that in general L(x, A, v) is not necessarily convex in x. On the 
other hand, when A; > 0 Vi, L(x, À, v) is simply a summation of convex and affine 
functions. So in this case, the function is convex. However, 4;’s could be negative, 
as there is no sign constraint on À in defining L(x, A, v). In such a case, g(A, v) 
could be —oo. For instance, think about a situation in which 2; = —1, fi (x) = œ 
(convex) and 4 € R. In this case, taking x = +00 yields g(A, v) = —oo. 

We are now ready to define the dual problem of our primary interest. Observe 
in (2.3) that g(4, v) is a pointwise minimum of affine functions (in (A, v)) over 
all x’s in X. Hence, it is concave in (A, v). Why? Think about what we proved in 
Prob 1.6(c): the maximum of convex functions is convex. More generally, one can 
prove that the maximum of affine functions is convex (similarly the minimum of 
affine functions is concave). Someone may still wonder about the above case (2.3) 
in which the minimum is taken over potentially infinitely many candidates of x 
not over only a few candidates. Even in this case, one can readily prove the claim 
still holds. The proof is not that difficult — think about it. Hence, the maximum is 
always attained. The dual problem is an optimization problem that intends to find 
the maximum. So it is formulated as: 


(Dual problem): max gàv): A>0. (2.4) 
V 
Notice that there is a constraint on A (A > 0) while there is none for v. This 


together with the definition (2.3) of the dual function gives the following equivalent 
expression: 


max min f (x) +S Afile) tv" (Ae 4): 2>0. (2.5) 


i=l 
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What strong duality means? Here is a summary of the primal and dual 
problems: 


(Primal): p* := min f(x): fix) < 0, ¿e {1,..., m}, Ax — b = 0; 


(Dual): d* := eon) + > Afi) +! (Ax— db): A> 0. 


i=1 


We denote by p* (or d*) the optimal value for the primal (or dual) problem. 
Using these, we can now state what strong duality is. What it means is that the 
optimal values of the two problems are equal: 


(Strong duality): p* = d*. (2.6) 


It has been shown that in general, the optimal values are different, i.e., strong 
duality does not hold. But interestingly strong duality (2.6) does hold for convex 
optimization of our interest, under a very mild condition.’ We call this the strong 
duality theorem. 

Now you may wonder why the strong duality theorem matters in the design 
of algorithms. The reason is that when strong duality holds, one can derive nec- 
essary and sufficient conditions (in order for strong duality to hold), which pro- 
vide algorithmic insights. So for the rest of this section, we will derive such 
conditions. For upcoming sections, we will understand why the conditions shed 
lights as to how to design algorithms. We will then prove the strong duality 
theorem. 


Necessary conditions for strong duality to hold Let us first focus on the 
derivation of necessary conditions. Suppose that strong duality holds p* = d*, 
and x* and (A*,v*) are the optimal solutions of the primal and dual problems, 
respectively. 


Since f (x*) = p* = d* = g(A*, v*) under the hypothesis, we get: 
Fe") = gA",v") 


min fe) + DAE) +T Ax b) 


i=1 


1. The mild condition says that there exists x such that strict inequality holds f(x) < 0 Vz subject to Ax = 6. 
The condition holds for almost all the problem instances that arise in reality. So one can say that strong 
duality usually holds for convex optimization. We will later discuss on this in detail. 
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i=1 
2 fl) 2.7) 


where (a) is due to the definitions of the dual function (2.3) and the Lagrange 
function (2.2); (b) comes from the fact that x* is a particular choice in view of 
the minimization problem in step (a); and (c) follows from the fact that 17 > 0, 
filx*) < 0 Vi, and Ax* — b = 0 (since (x*, A*) must be feasible points). 

In the above, the left hand side and the right hand side are the same as f (x*), 
suggesting that the two inequalities in steps (b) and (c) are tight. From the tightness 
of the inequality (4), we see that x* indeed minimizes £ (x, A*, v*) over x. Since it 
is unconstrained, the optimality condition is that its gradient at x* is zero: 


V,L£(x*, 4*, v*) = 0. (2.8) 


On the other hand, the tightness of the second inequality (c) implies 
D7 Aš A(x*) = 0, which in turn yields: 


M(x") =0 Vi. (2.9) 


This is because A > 0, fi(x*) < 0 Vi for feasible points (x*,A*). There is a 
name for this condition. It is called the complementary slackness condition. Why 
do we have the naming? The term A*f;(x*) captures sort of slack (gap) between 
d* := g(A*,v*) and p* := f(x*); see the Ist, 3rd and 4th line in (2.7). The 
condition (2.9) implies: V2, 


iG) <0 = 2 =U; 
A> 0 = Fie y=t. 


This says that whenever one of the inequality constraints is strict, the other inequal- 
ity must be tight, i.e., both are sort of complementary in view of ensuring the 
equality. 

The conditions (2.8) and (2.9) together with the constraints in the primal and 
dual problems then constitute the following necessary conditions for strong duality 


to hold: 


V L(x", A*,v*) = 0; (2.10) 
MA (x*) = 0 Vis (2.11) 
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fils") <0 Vi (2.12) 
Ax* — b = 0; (2.13) 
A* > 0 (2.14) 


where (2.12) and (2.13) come from the primal problem and (2.14) is from the dual 
problem. 

In fact, these conditions (2.10)~(2.14) coincide with the ones that we men- 
tioned in Section 1.11 while deriving the closed-form solution for the equality- 
constrained least squares problem. These are the KKT conditions! Remember that 
the KKT conditions are not limited to convex optimization, but are intended for 
general convex & non-convex optimization problems. These are necessary condi- 
tions for a solution to be optimal for a general optimization problem. 


KKT conditions are also sufficient for strong duality to hold Interestingly 
the KKT conditions are also sufficient for strong duality to hold: 


KKT conditions => p* = f (x*) = g(4*, v*) = d* (2.15) 


where (x*, 4*, v*) are the points that respect the KKT conditions. 

We will prove the following two: p* < d* and p* > d*. Focus on the former. 
To this end, we will first show that f(x*) = g(4*, v*). Recall the definition of the 
Lagrange function (2.2) to obtain: 


L(x*, At, v*) = fe") + DAA") + oT (Ax* - b) 
i=l (2.16) 
= f(x") 


where the second equality follows from (2.11) and (2.13). On the other hand, from 
the definition of the dual function (2.3), we get: 


g(A*,v") := min L(x, 1*, v*) 
xEX 


= min f (w) + D Ai file) + "(Ax — 8) 
= (2.17) 
2 Fet E E) T Ax" — b) 
i=1 


= L(x*, A*,v") 
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where (a) comes from the fact that the condition (2.10) in the unconstrained con- 
vex minimization suggests that x* is the minimizer of L(x, 1*, v*). This together 
with (2.16) gives: 


fE) = ¢G",v"). (2.18) 


Since p* is the optimal value in the primal minimization problem, p* < f(x*). 
Also d* > g(A*,v*) as it is the optimal value in the dual maximization problem. 
These together with (2.18) yield: 


ged. (2.19) 
Now we will below prove that 


rer (2.20) 


to complete the proof of sufficiency of the KKT conditions for ensuring strong 
duality. To prove (2.20), consider a primal optimal point, say x, that achieves p*. 
Also consider a dual optimal point, say (A, V), that achieves d*. Then, 


Pp =f) 


2 £6 + DU +07 az - 6) 


i=1 


= min) + Sifi) + 07 (Ax b) (2.21) 


i=l 
b ~ i 
2 63,5) 
= d* 


where (a) follows from the fact that 1; > 0, fils) < 0 Vi and Ax — b = 0 (since 
(x, 1, 0) must be feasible points); and (4) is due to the definition (2.3) of the dual 
function. 

Actually the relationship between p* and d* stated in (2.20) is called weak dual- 
ity. It turns out weak duality holds for any optimization problem (including non- 
convex optimization), and this will be explored further in a later section. 


Look ahead What can we do with the KKT conditions in the design of algo- 
rithms? In the next section, we will study details on this, and will demonstrate that 
the conditions indeed play a crucial role in gaining algorithmic insights. 
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2.2 Interior Point Method 


Recap In the previous section, we embarked on Part II and started investigating 
strong duality and KKT conditions which we claimed several times that they pro- 
vide a detailed guideline as to how to design algorithms. Strong duality relies on 
the concept of primal and dual problems: 


(Primal): p* := min f(x): f(x) < 0, ¿e {1,..., m}, Ax — b = 0; 


Dual): d* := i Aif; T(Ax — b): 

(Dual): d peg haps fi(x) +v (Ax — 6): 2 >0 
where VY := domf  domf, N --- N domf,,. Using these, we figured out what 
strong duality means: 


(Strong duality): p* = d*. (2.22) 


We then stated (yet without proof) that strong duality (2.22) holds for convex opti- 
mization, under a mild condition. Next we derived necessary and sufficient condi- 
tions (KKT conditions) in order for strong duality to hold under a feasible point 
of (x*, 1*, v*): 


V, L(x", 4", v*) = 05 (2.23) 
M(x") =0 Vis (2.24) 

fx") <0 Wis (2.25) 

Ax* — 6 =0; (2.26) 

a* >00. (2.27) 


Lastly we claimed that the KKT conditions give algorithmic insights. 


Outline In this section, we are going to study details as to why that is the case. We 
will support the claim in the context of the following three problem settings. The 
first is a somewhat special yet prominent problem setting where we already saw the 
KKT conditions (in Section 1.11): the equality-constrained least squares problem. 
In this case, we will demonstrate that the KKT conditions indeed lead to the closed- 
form solution that we saw. The second is a broader setting which however has still 
only the equality constraints. In this setting, we will show that the KKT conditions 
can be solved via gradient descent that we studied in Section 1.3. The last is a 
general setting which contains inequality constraints as well. We will introduce one 
very powerful algorithm, called the interior point method (Wright, 2005), which 
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can approximately implement the KKT conditions and therefore can approach the 
optimal value with a reasonably small performance gap to the optimality. 


Equality-constrained least squares Recall the equality-constrained least 
squares problem: 


min ||Ax — bll? : Ce —e = 0. 


Let us verify that the KKT conditions (2.23)~(2.27) lead to the closed-form 


solution x* that respects 
2AA CT]|[x* 2ATb 


for some z. Let L(x, v) be the Lagrange function: 
L(x, v) = ||Ax — bl? + v7 (C — o). 


We first simplify the KKT conditions, tailoring them for this equality- 
constrained setting: 


VL", v*) = 0; (2.29) 
Cx* — e =0. (2.30) 
Taking a derivative of the Lagrange function w.r.t. x and setting it to 0, (2.29) reads: 
V,L(x*, v*) = 2AT Ax* — 2AT b + CTv* =0. (2.31) 

This together with the equality constraint yields: 
2A"Ax* — 2A" b+ C7v* = 0; (2.32) 
Cx* —e=0. (2.33) 


Compared to the setting investigated in Section 1.11, the only distinction here is 
that we used a different notation v* instead of z. 


Equality-constrained convex optimization What about for general convex 
optimization problems? It turns out that solving the KKT conditions, one can 
develop some algorithms. In particular, for eguality-constrained optimization prob- 
lems, one can come up with a simple algorithm. 

So let us first consider the equality-constrained setting: 


min f(x): Ax -—5b=0. (2.34) 
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The Lagrange function is defined as: L(x, v) = f(x) + vT (Ax — b). Under this 
setting, the KKT conditions (2.23)~(2.27) read: 


ViL(x*, v*) = 0; (2.35) 
Ax* — 6 =0. (2.36) 


Here the second condition can be rewritten as: 
Ax* — b = 0 => V,L(x*, v*) = 0. (2.37) 


Since strong duality holds p* = d* in the convex optimization setting (due 
to the strong duality theorem), it suffices to develop an algorithm that achieves the 
optimal value in the dual problem. So we focus on: 


d* := maxg(v) 


= max min L(x, v). 
Vv xE 


Here one can make the following observations: (i) L(x, v) is convex in x; (ii) 
minyex L(x, v) is concave in v; (iii) minyex L(x, v) is unconstrained (w.r.t. x); and 
(iv) max, minyev L(x, v) is unconstrained (w.r.t. v). Remember in Section 1.3 that 
the optimal condition for unconstrained convex minimization (or maximization) is 
that the gradient evaluated at the optimal point must be 0. More specifically, the 
optimality condition for the inner minimization problem is: given a v, 


V,L£(x*(v),v) = 0 (2.38) 


where x* (v) := arg minyey L(x, v). The optimality condition for the outer maxi- 
mization problem is: 


Vi L(x*(v*), v*) = 0 (2.39) 


where L(x*(v),v) = minyex L(x, v). Letting x* := x*(v*), the two condi- 
tions (2.38), (2.39) yield: 


V,L£(x*,v*) = 0; 
Vy L(x", v*) = 0. 


This then naturally motivates us to find a point (x*, v*) such that the two gradients 
are Zeros. 
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a L(x, v) 


Figure 2.2. Alternating gradient descent for equality-constrained optimization. 


Gradient descent In Section 1.3, we studied one popular algorithm which 
allows us to find a stationary point where its gradient is 0. That was: gradient descent. 
So we can use the same algorithm. The only distinction here is that we have two 
points (x, v) to optimize over and so we have two corresponding gradients to com- 
pute. Below is how a modified algorithm works. 

Let (x, v) be the estimates at the sth iteration; see Fig. 2.2. First we compute 
a gradient w.r.t. x at the point: VL, v)., Since L(x, v) is convex in x (see 
Fig. 2.2 as well), we should move the point to the opposite direction (in reference 
to the gradient) so as to approach the optimal solution. So we update x as: 


xEtD e yO aO VLO, v®) 


where a) > 0 indicates the learning rate, which is usually set as a decaying func- 
tion like a = $. 

Next, we compute a gradient w.r.t. v: V£ (x®,v®). Notice that miny 
L(x, v)(=: L(x* (v), v)) is concave in v; see the bottom part of the curve in Fig. 2.2 
where the minimum of L(x, v) is attained over x. So we should move the point to 
the same direction (in reference to the gradient) so as to approach the optimal solu- 
tion. So we update v™ as: 


vD © yO 4 BOV, LEO, vO) 


where B® > 0 indicates another learning rate, which is not necessarily the same as 
a), Precisely speaking, this algorithm is called gradient ascent instead of gradient 
descent, although many people usually call it gradient descent nonetheless. So the 
above entire procedure is often called alternating gradient descent. 
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We repeat the above procedures until (xe, vp) converges. It turns out: as £ > 
00, it actually converges: 


(x, vO) — (&*,v*), (2.40) 


as long as the learning rates (a, 8) are properly chosen (like decaying func- 
tions). As in Section 1.3, we will not touch upon the convergence proof. 


Interior point method (Wright, 2005) What about for general convex opti- 
mization settings which also involve inequality constraints? It turns out this is a bit 
challenging case. It is not that simple to solve the KKT conditions (2.23)~(2.27) 
directly. Instead there are algorithms which can approximate the KKT conditions. 
One such very popular algorithm is the interior point method. 

The idea of the method is to take the following two steps: 


1. Approximate the primal problem into an eqguality-constrained optimization. 
2. Apply equality-constraint-tailored algorithms (like alternating gradient 
descent explained earlier) to the approximated optimization. 


Since the method is based on an approximation trick, one may wonder how the 
performance of such an approach is far from optimality. It turns out that with 
a proper approximation trick (that we will investigate soon), we can achieve the 
optimal solution with a small gap to the optimality. To see this, let us first investigate 
what the approximation trick is. 


Approximation trick Recall the standard form of general convex optimization 
including inequality constraints: 


min f(x) : f(x) < 0, ¿e {1,...,m}, 


2.41 
Ax —b=0 ee 


where f (x) and f;(x)’s are convex functions, A € R? xd and p<d. 

How to handle the inequality constraints? What we wish to do is to merge them 
with the objective function f (x) so that we have only equality constraints. To this 
end, we can set up a specific goal as: 


fix) < 0 — the merged objective function = f(x); 


fi(x) > 0 — the reformulated optimization is infeasible. 


In an effort to implement the goal, we introduce a function, called the barrier 
function, defined as: 


0, z <0; 


œo, z>0. a 


I_(z)= | 
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Now inputting f;(x) to the barrier function as an argument, we get: 


0,  filx) < 05 


LH) = ba filx) > 0. 


This then motivates the following natural idea: Adding /_(f;(x)) to the objective 
function f (x). This leads to the following reformulated problem: 


min f(x) + >) LEO): Æ- 6 =0. (2.43) 


i=1 


Notice that when fj(x) < 0 Vi, the objective function is unchanged; on the other 
hand, whenever f(x) > 0 for some i, f(x) takes infinity, making the problem 
infeasible. 


A surrogate of the barrier function In the reformulated problem (2.43), how- 
ever, one critical issue arises. The issue comes from the fact that /_(-) is not differ- 
entiable. Notice that the first KKT condition (2.35) includes the gradient term. So 
it requires differentiability of the barrier functions as they appear in the Lagrange 
function. However, since J— is not differentiable, we cannot implement the KKT 
conditions. 

To resolve the critical issue, we consider a surrogate of the barrier function, which 
is differentiable and well approximates the barrier function. One very well-known 
surrogate is the logarithmic barrier: 


LB(z) := — u log(—z), mw> 0. (2.44) 


The function is indeed differentiable, and it well approximates the barrier function 
for a small u. See Fig. 2.3. Moreover, it is convex in z, and hence, we can maintain 
the objective function as a convex one. 


Approximated convex optimization Replacing the barrier function with the 
logarithmic barrier in (2.43), we can approximate (2.43) as: 


min f(x) — u È log(—f(x)) : Ax — b = 0. (2.45) 
i=1 
There is one caveat here. That is, the search space of x should be: 


{x: Ale) <0,..., fax) <0, Av — b= 0}. (2.46) 


This is because the equality f;(x) = 0 for some 7 makes the logarithmic barrier 
function blow up. So we assume that the set in (2.46) is not empty. Actually this 
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Figure 2.3. Logarithmic barrier functions for different control parameters u. 


suggested the naming of “interior point method”, since the method searches over 
interior points. 

Since the approximated optimization (2.45) contains only the equality con- 
straint, we can apply exactly the same approach that we took for the earlier equality- 
constrained setting. In other words, we first compute the Lagrange function: 


Læ v) = f(x) — u X log(—fie)) + v7 (Ax - b). (2.47) 


i=1 
We then try to find a stationary point (x*, v*) that satisfies the KKT conditions: 


Vale, v*) = 0; 
(2.48) 
Vi L(x", v*) = 0. 


Again one can use alternating gradient descent to solve this. 


Performance gap to the optimality Once we employ the interior point 
method that is based on the approximated optimization (2.45), one natural ques- 
tion that arises is: How far is the performance of the approximation approach from 
optimality? 

To figure this out, we first consider the stationary point (x*, v*) that respects the 
KKT conditions (2.48). At this point, we obtain f(x*) and obviously f(x*) > p*, 
since x* (that satisfies (2.48) intended for the approximated optimization) is not 
necessarily the optimal solution of the original non-approximated optimization. 
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So the performance gap can be quantified as f (x*) —p*. The gap depends obviously 
on u, which is a control parameter adjusting the closeness to the barrier function. 
Let us figure out how it varies over u. Remember that the smaller u, the more 
precise the approximation is. Hence, one can expect that the smaller u, the smaller 
gap. It turns out it is the case: 


f) -—p* < mu. (2.49) 


Observe that the gap is at most mu, so we approach the optimality for a small 
value of u. 

A special note on the choice of u: One may want to set u arbitrarily small to 
ensure almost the optimal performance. In practice, however, this is not suggested. 
The reason is that the KKT conditions (2.48) are implemented via an algorithm 
(like alternating gradient descent) whose convergence speed is significantly affected 
by u. The smaller u, the slower speed. Hence, in practice, the choice should be 
carefully made taking the tradeoff into consideration. 


Proof of f(x*)—p* < mu Herex* together with v* denotes the stationary point of 
the approximated optimization (not necessarily the optimal solution to the original 
optimization): 


VeLapp(x", v*) 
= Vx (rs — u >) log(-f@)) + v7 (Ax - ») =0 (2.50) 
i=1 


where Lapp (x, v) denotes the Lagrange function of the approximated optimization. 
Starting with strong duality, we get: 


p” = d* 
= magan) 
© e(*,v") (2.51) 
(0) : á xe aT 
= min f(x) + 2d Ai file) + v*7 (Ax b) 


where (a) follows from the fact that v* is a feasible point that satisfies (2.50) (not 
necessarily the one that maximizes g(A,v)), and 2* is another particular feasible 
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point which will be detailed soon; and (6) is due to the definition of the dual 
function. 
Remember that (x*, v*) is a point that satisfies (2.50), and hence: 


Vi Lapp (x*, v*) = V(x") + Se ay Ve )+Alv* =0. (2.52) 


From this, we see that the particular point 2* can be chosen as: 


* =u 
fils) 
Under this particular choice (2.53), the condition (2.52) implies that x* is a mini- 
mizer of the optimization in step (b) in (2.51) under (— u /f (x*), v*): 


min L(x, —H/f(x"), v") = L(x", —u/f("*), v*) (2.54) 


where L(., -, -) indicates the Lagrange function of the original optimization. Hence, 
applying this to (2.51), we get: 


p* > minf (x) + > IFE) + v*T (Ax — b) 


= f(x" o ie Fhe) + vat D ve) 


2 F(x") — mu 


where (c) comes from Ax* — b = 0 (since x* must be a feasible point). This then 
yields the upper bound of the gap to complete the proof: 


f(x") — p* < mu. 


Look ahead So far we have studied what strong duality is, and derived the KKT 
conditions, which are necessary and sufficient conditions for strong duality to hold 
in convex optimization. We also demonstrated that the KKT conditions provide 
detailed guidelines as to how to design algorithms. In the next section, we will 
prove the strong duality theorem which we only stated without proving. 
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Problem Set 5 


Prob 5.1 (Lagrange function & dual function) Consider a general optimiza- 
tion problem (not necessarily convex): 


min f(x) : fix) < 0, z€{1,...,m}, 
xER’ (2.56) 
Ax —b6=0 


where A € R?*?, b e R? and p < d. Assume that f : Rf > R and f; : Rf > R 
are differentiable. 


(a) State the Lagrange function L(x, 4, v), the dual function g(4, v), and the 
dual problem. 

(b) Prove that g(4, v) is concave in (A, v). 
Note: Please do not use any properties that we learned in Sections and/or Problem 
Sets (without derivation). Use only the definition of concavity. 

(c) Let p* (or d*) be the optimal value of the primal (or dual) problem. Show 
that 


pad. (2.57) 


Prob 5.2 (KKT conditions for Quadratic Program) Consider an equality- 
constrained least-squares problem: 


min ||Ax — b||? : Cc -—e = 0 (2.58) 
where A € R”*4 and C e R?*4, Assume that m > d and p < d. 


(a) State the KKT matrix. Also prove that the following condition is necessary 
and sufficient for the KKT matrix to be invertible: 


rank(C) = p, rank (6) =d. (2.59) 


(b) State the KKT conditions. Suppose that the KKT matrix is invertible and 
there exist x* € R? and z € R? such that the KKT conditions are satisfied. 
Show that: 


Ax — bl? > |Ax* — b||? Ve: Cx— e = 0. (2.60) 


Prob 5.3 (Performance of the interior point method) Consider a convex 


optimization problem: 


p“ = max g(A, v) A. 20 (2.61) 
V 
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where 2 € R” and v €e R? are optimization variables, and g(A, v) is a concave 


function. Using the logarithmic barrier function: 


LB(z) := — u log(—z) for u > 0, 


we formulate an approximated optimization problem: 


(a) 
(b) 


(c) 


max g(4, v)+ u > log Aj. (2.62) 


i=1 


Suppose (A, i) is a feasible point that satisfies the KKT conditions w.r.t. the 
approximated problem (2.62). State the KKT conditions. 

State the Lagrange function Laual (A, V, Aduai) of the primal problem (2.61). 
Also state the dual function gguai(Aduai) of Laual (Å, V, Aquat). Here Agua 
denotes a Lagrange multiplier w.r.t. (2.61). 

Suppose (A*, v*) is a feasible point that satisfies the KKT conditions w.r.t. 
the primal problem (2.61). State the KKT conditions. 


(d) Show that: 


p* — gÀ, 5) < mu. (2.63) 


Prob 5.4 (True or False?) 


(a) 


(b) 


Consider primal and dual problems with the optimal values p* and d*, 
respectively. Then, the KKT conditions are necessary and sufficient condi- 
tions for p* = d* to hold. 


Consider a convex optimization problem: 


p* = max g(A, v) :2>0 (2.64) 
AV 


where A e R” and v e R? are optimization variables, and g(4, v) is a 
concave function. Consider another optimization problem: 


A, log A; 2. 
maxg(4, v) + u 2d og (2.65) 
where u > 0. Let A be a feasible point, i.e., 1 > 0. Then, 

PY — gÀ, 0) < mu. (2.66) 


as long as V is a feasible point that satisfies the KKT conditions w.r.t. the 
approximated optimization (2.65). 
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(c) Consider a convex optimization problem: 


max f(x): Ax —-b6=0 (2.67) 
xER4 


where A € R?*@ and b € R?. Let L(x, v) be the Lagrange function where 
v € R? denotes the Lagrange multiplier. Suppose x is a stationary point, 
i.e, V L(x, v) = 0. Then, £(x, v) is always concave in v. 

(d) Consider a convex optimization problem: 


p:= min f(x): fi) <0 ze {l,...,m}, 
“ene (2.68) 
ple) =0 że{1,...,p}. 


Let x* be a feasible point that achieves the optimal value p*. Suppose that 
there exists a feasible point (A, Ù) w.r.t. the dual problem such that the KKT 
conditions are satisfied: 


V,L(x*, 4,0) = 0; 
fe) =0 Vis 
f) <0 Vis (2.69) 
hi(x*) =0 Vi; 
A; >0 Vi 


where L(x, A, v) indicates the Lagrange function. Then, the feasible point 
(A, 3) also achieves the optimal value d* of the dual problem. 
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2.3 Proof of Strong Duality Theorem (1/2) 


Recap During the past sections, we have investigated strong duality. In order to 
understand what it means, we studied the concept of primal and dual problems: 


(Primal): p* := min f(x): fi) < 0, ¿e {1,...,m}, Ax — b = 0; 


(Dual): d* := maxg(4, v) >: A>0. 
V 


Using these, we stated: 
(Strong duality): p* = d*. (2.70) 


We then argued that strong duality (2.70) holds for convex optimization under a 
mild condition. The mild condition was: 


See fix) < 0,..., fax) < 0,4x— b= 0. (2.71) 


Next we derived necessary and sufficient conditions (KKT conditions) in order for 
strong duality to hold. In the last section, we found that the KKT conditions indeed 
give algorithmic insights. 


Outline In this section, we are going to move onto the strong duality theorem that 
we deferred the proof of. Actually the proof is not that easy. Not only the proof 
takes many non-trivial steps together with a bunch of ideas, but it also requires 
some important theorem that we did not dig into. So we will prove it step-by-step 
so that you can easily grasp how the proof goes on. Specifically we will investigate 
from simple to general cases: (i) unconstrained case; (ii) equality-constrained case; 
(iii) znequality-constrained case; and (iv) general case (including both equality-&- 
inequality constraints). In this section, we will cover the 1st and 2nd cases. In the 
next section, we will prove the 3rd and last cases to complete. Throughout the proof, 
we will assume that a minimum is attained, i.e., p* is finite. Otherwise, p* = — o0 
or +00. This is definitely not an interested scenario. 


Unconstrained optimization This is a trivial case. In this case, the primal and 
dual problems read: 
p“ := min f (x); 


d* := maxg. 
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Since d* is simply g, we get: 
d* =g: 2 min L(x) 
xEX 
=e 

where (a) is due to the definition of the dual function. 
Equality-constrained optimization Consider: 

p“ :=minf(x): Ax — b = 0; 

d* := maxg(v) 

v 

where A € R?*? and p < d. Without loss of generality, assume that rank(A) = 
min(p, d) = p. Why? Otherwise, one can remove dependent rows in A to make 
it full-ranked. Remember that p > d is not of our interest, since in the case x* is 


solely decided by the equality constraint, having nothing to do with the objective 
function. We will prove strong duality by showing the following two: 


ped: (2.72) 
p <d. (2.73) 


In fact, (2.72) is what we proved earlier in Section 2.1 for a more general context. 
In the sequel, we will repeat the proof for those who do not remember details. 


Review of the proof of (2.72): p* > d* Suppose that a feasible point in the 
primal problem, say x*, achieves p*; similarly, another feasible point in the dual 
problem, say v*, achieves d*. Using the fact that (x*,v*) are the minimizer and 
maximizer of the primal and dual problems respectively, we get: 


p =f") 
O fo") + v*7 Ax" — b) 
> min f(x) + v*" (Ax — b) 


ee 
2 ev") 


= d* 


where (a) follows from Ax* — b = 0 for a feasible point x*; and (b) comes from 
the definition of the dual function. 
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Proof of (2.73): p* < d* The proof of this is not that straightforward. It relies 
upon some trick which is based on a smartly-manipulated set (that you will see 
soon), as well as a well-known theorem, concerning the role of a hyperplane? when 
there are two disjoint convex sets. As of now, you may have no idea of what we are 
talking about. Don’t worry. This will be clearer soon. 

Let us start by defining the smartly-manipulated set: 


S := {(v,t) € RPH! : Ax such that f(x) < t, Ax — b = v}. (2.74) 


Three key properties of the set S: We point out three key properties of S, which 
will play a crucial role in proving (2.73). The first is that the set S contains the 
optimal point p* of the primal problem when v = 0, i.e., (0,p*) € S. Also the 
point p* is the minimum when v = 0, i.e., (0, p*) is on the boundary of the set. 
Why? Suppose that p* is not the minimum, i.e., (0, p*) is strictly inside S. Then, 
there exists some arbitrarily small € > 0 such that another point (0, p* — €) is in 
S, which contradicts with the fact that p* is the optimal value. 

The second property is that: 


mt ES => (vt) ES, Vl >t. (2.75) 


This is obvious, since f(x) < ¢ implies that f(x) < +’ for ¢’ > t. For instance, any 
point (0, z’) € S whenever ¢’ > p*. See a blue line in Fig. 2.4 for illustration. 
The last property that we would like to emphasize is that the set S is convex. The 
proof of this is straightforward. Suppose (v1, t1), (v2, t2) € S. Then, this together 
with the definition (2.74) of the set S yields: there exist some points, say xı and x2, 


U 


Figure 2.4. The set S := {(v,t) € RPH! : Jy such that f(x) < t, Ax — b = v} contains the point 
(0, p*) as well as any point (0, 2’) where ¢’ > p*. This figure is a simplified version when the 
dimension of v is 1. 


2. For those who do not remember the definition of the hyperplane, here we echo. A hyperplane is a linear 
subspace whose dimension is one less than that of its ambient space. 
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such that 


fa) <4, An -—b=n; 
Ff (x2) <h, Ax —b= V2. 


Applying an A-weighted convex combination to the above, we get: for À € [0, 1], 
Avi + (1 — åm = Ax, + (1 — A)x2) — b; 
At) + (1 — A)t2 > Af (x1) + ad — A)f (x2) 


T TE 


where (a) follows from the convexity of f (x). This implies that there exists x = 
Axı + (1 — A)x2 such that: 


fœ) <4 4+ —- i)n; 
Ax —b= iv, + (1 —A)v2, 


which in turn yields that A(v, 41) + (1 — 4) (v2, t2) € S, thus proving the convexity 
of S. 

How the set S looks like and the use of the separating hyperplane theorem: Using 
the second and third properties mentioned above, one can imagine how the set 
looks like. Since it is convex and also any point above the boundary is in the set, the 
boundary of the set would be bow/l-shaped, as illustrated in Fig. 2.5. 

We are now ready to introduce a well-known theorem regarding a hyperplane, 
so called the separating hyperplane theorem. The theorem says: If there are two 
disjoint convex sets, then there exists a hyperplane which separates the two con- 
vex sets. Intuitively this makes sense. Why? Think about two disjoint circles in a 
2-dimensional space, which are obviously convex. Then, there must be a line some- 
where in between the two circles, which separates the two. Actually the proof of this 


bowl-shaped v 


Figure 2.5. The boundary (marked in the blue curve) of the convex set S := {(v, t) e RPT! ; 
Ax such that f(x) < t, Ax — b = v} is of a bowl shape. 
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trivially-looking theorem is non-trivial. Here we will not cover the proof, but you 
will have a chance to prove it in Prob 6.11. 

Now you may wonder why the theorem kicks in. The reason is that the theorem 
allows us to come up with a hyperplane which passes through the boundary point 
(0, p*) while separating the set S from another disjoint convex set, and this will 
help us to prove (2.73) in the end. Why does the theorem ensure the existence of 
such a hyperplane? To see this, consider another set, say S’, defined as: 


S’ := {(0,5) € RPH! : s < p*}. (2.76) 


Obviously this is convex (as it is just a line) and disjoint with S. Now using the 
separating hyperplane theorem, we can say that there exists a hyperplane which 


separates S from S’ while passing through the boundary point (0, p*). Let H € 


R?+! be the support vector of the hyperplane, being perpendicular to its tangent 
vector. Then, the hyperplane is represented as: 


Hi (El-p =° (2.77) 


Why? Notice that the slope is the support vector and this plane passes through the 
point (0, p*). The separating hyperplane theorem says that whenever (v, t) € S, it 
always lies in the right-hand-side space in reference to the hyperplane, i.e., 


woes (TCE 


See Fig. 2.6 to help your understanding. 
Last step of the proof: Using (2.78), we see: 


up’ <vivt+tut Yv t) Es. (2.79) 


BIED 


Figure 2.6. There exists a hyperplane that passes through (0, p*) in the set S while sepa- 
rating S from another disjoint convex set. 
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One important thing to notice here is that u > 0. To prove this, suppose u < 0. 
Then, one can make t 4 oo. Observe that such (v, t) = (v, 00) is still in S due 
to the second property (2.75) of S. But this yields a contradiction with (2.79) as: 


up* < vi v+ut=—oo. 


Notice that for finite p* (which we assumed), jup* is also finite, which can never go 
below —oo. Another thing to notice is that: 


#0. (2.80) 


We will prove this soon. This together with u > 0 gives: u > 0. This then enables 
us to divide both sides in (2.79) by u > 0, thus obtaining: 


sÀ 
p< (~) v+t Vm, t) ES. (2.81) 


Recall the definition of the interested set: S := {(v,t) € R?t! : Jx such that 
f(x) < t, Ax — b = v}. The fact that (v, t) € S means that there exists x such that 
f(x) < t and Ax — b = v. Applying such x’s to the above (2.81), we obtain: 


gar 
p < foe (=) (Ax — b) 
p (2.82) 


T 
(a) ; v 
= min = Ax — b 

da © a C) i i 


where (a) follows from the fact that minimizing ż is equivalent to minimizing f (x). 
Notice in the above that f(x) < ¢ becomes unconstrained by taking t > oo. 
Hence, taking t 00, we get: 


T 
p* < minf (x) + (>) (Ax — b) 


b 
- Js 


where (a) is due to the definition of the dual function; and (b) comes from the 
definition of d*. This completes the proof of (2.73). 
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Proof of (2.80): u # 0 The proof idea is by contradiction. Suppose u = 0. 
Then, (2.79) implies: 


viv>0 Yv, tes. (2.83) 


Two things to note. The first is that v 4 0. This is obvious. Otherwise (v, u) = 0, 
and this implies that there does not exist a hyperplane that passes through (0, p*) 
in S while separating S from another disjoint convex set. This then violates the 
separating hyperplane theorem. The second is that v can be chosen arbitrarily as 
long as there exists x such that Ax — b = v and f(x) < t. Due to the second 
property of S mentioned earlier, (v, 00) € S whenever (v, t) € S. Hence, the only 
thing that we need to worry about is whether there exists x satisfying Ax — b = v. 
Now remember what we assumed in the beginning: A has full rank (rank(A) = p). 
This suggests that we can choose x such that Ax — b points to an arbitrary direction. 
Hence, there exists some point, say x’, such that the direction of Ax’ — b = v is 
somewhat opposite to v so that: 


viv <0. 


This contradicts with (2.83), thus completing the proof of (2.80). 


Look ahead It turns out that using the techniques employed so far, one can prove 
p* < d* for the inequality-constrained case and general case. We will cover the two 
remaining cases to complete the proof in the next section. 
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2.4 Proof of Strong Duality Theorem (2/2) 


Recap In the last section, we proved the strong duality theorem for two simple 
cases: (i) unconstrained optimization; and (ii) equality-constrained optimization. 
The key trick was to introduce the following set: 


S := {(v,t) € RH! : Ax such that Ax — b = v, f(x) < t} (2.84) 


where v € R? and ż € R. By using the key properties of S (e.g., the convexity of 
S and (0, p*) € S) together with the separating hyperplane theorem, we proved 
p“ < d* where p* and d* are the optimal solutions of the primal and dual problems 
respectively. Combining this with the already-proven fact p* > d*, we completed 
the proof. 


Outline In this section, we are going to cover the two remaining cases (inequality- 
constrained and general optimization) to complete the proof. Again we assume that 
p* is finite. Actually in a more general setting that includes inequality constraints, 
the statement of the strong duality theorem should be made carefully. The precise 
statement reads: p* = d* holds under a mild condition: 


Ax such that strict inequalities hold i.e., fi (x) < 0,...5fin(x) < 0 (2.85) 


where f;(x)’s indicate the LHS functions in the standard form of inequality con- 
straints. This condition is called Slaters conditon, since it was found by a math- 
ematician Morton L. Slater (Slater, 2014). It serves as a sufficient condition for 
strong duality to hold. It is considered to be mild, as the condition often holds in 
practice. 


Inequality-constrained optimization For illustrative purpose, we first con- 
sider a simple case having only one inequality constraint: 


p“ := minf (x): Ai) < 0; 
d* := max g(). 


It turns out one can readily extend this to a general case . So let us focus on this 
simple setting for now. Like the equality-constrained case, one can easily show that 
p* > d*. The proof is almost same. Please check this by yourself. So it suffices to 
prove the other way around: 


Psa (2.86) 
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Proof of (2.86): p* < d* Like the equality-constrained case, we introduce a 
smartly-manipulated set that plays an important role in the proof. The set is defined 
similarly to (2.84): 


S = {(u,t) € R? : ax such that fi (x) < uf (x) < t} (2.87) 


where u e R. One distinction is that we now have u (instead of v) which acts as 
an upper bound on fi (x). Similar to (2.84), the newly-defined set S (2.87) also 
exhibits three properties: 


(i) It contains the boundary point (0, p*); 
(ii) (u, t) E€ S > (u,t') € S whenever w > uand t > t; 
(iii) S is convex. 


The first and second are obvious. The third property makes an intuitive sense if 
we think about a picture, as illustrated in Fig. 2.7. Consider a case in which u > 
0. Actually this is a more relaxed scenario relative to u = 0. This is because the 
constraint fi (x) < u has a larger search space compared to fi (x) < 0. Hence, the 
minimum t € S would be smaller than or equal to p*. On the other hand, when 
u < 0, the constraint fi (x) < u yields a shrinked search space; and therefore the 
minimum ¢ € S would be /arger than or equal to p*. With this argument, one can 
image a shape of the set S like the one in Fig. 2.8 (a cyan-colored region). So one 
can conjecture that the set S is convex. It turns out it is indeed the case. The proof 
is almost the same as in the equality-constrained case. 

Proof of the convexity of S: Suppose (u1, tı), (v2, t2) € S. Then, this together 
with the definition (2.87) of the set S yields: there exist some points, say xı and x2, 
such that 


AC) <u, Fi) Ss: 
fi) < m, f) <n. 


u 


Figure 2.7. The set S = {(u, t) € R? : 3x such that fi (x) < m f(x) < t} contains the point (0, p*) as 
well as any point (0, z’) where ¢’ > p*. Also for u > 0, the minimum z in the set S is smaller 
than or equal to p*; when u < 0, the minimum ¢ is larger than or equal to p*. 
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u 


RT u 0 = 
lasl 
Figure 2.8. There exists a hyperplane (a line in this example) that passes through (0, p*) 


in the set S = {(u, t) € R? : Ax such that fi (x) < m, f(x) < t}. 


Applying an A-weighted convex combination to the above, we get: for 2 € [0, 1], 


Auy + (1 — A)ug = Afi) + 1 — AYA G2) 


Sa 2d= ie) 


where (a) follows from the convexity of fi (x). Similarly, using the convexity of f (x), 
we get: 


Ath + (1 — A) > f(Ax + (1 — A)x2). 
Hence, A (u1, t1) + (1 — 4)(u2, t2) € S. This completes the proof. W 


A 
Now using the separating hyperplane theorem, we argue that there exists H € 


woos [J e 


Looking at the hyperplane in Fig. 2.8, we see that the support vector w.r.t. the 


R? Æ Ost. 


hyperplane has a positive direction. Hence, one may conjecture that 
A>0, w>O0. (2.89) 


It turns out this is indeed the case. 
Proof of (2.89): Notice that (2.88) gives: 


up < îu+ ut VutyeS. (2.90) 


Let us first prove u > 0. Suppose u < 0. We then make t —> oo. Here such 
(u, 00) is still in S due to the second property of S. But this yields a contradiction: 


Lp” < Aut ut = —oo. (2.91) 


Similarly, one can prove A > 0. W 
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Let us go back to the main stream of the proof. Like the equality-constrained 


case, it turns out: 
u#z0 (2.92) 


where the proof will be given soon. This together with (2.89) gives u > 0. Dividing 
both sides in (2.90) by u > 0, we obtain: 


A 
Pp <tt—u Vues. (2.93) 
u 


Recall the definition of the interested set: 
S := {(u,t) € R? : dx such that fi (x) < u, f(x) < t}. (2.94) 


The fact (u, t) € S means that there exists x such that fi (x) < u and f (x) < t. By 
applying such x’s to (2.93), we obtain: 


mg min t+—u 
xifi (x) <u (x) <t 4 


(a) 5 A 
= min x) + [A(x 
see ) ail ) 


(2.95) 


where (a) follows from the fact that minimizing ¢ and u are equivalent to minimiz- 
ing f (x) and fi (x), respectively. Notice in the above that fi (x) < u and f(x) < t 
become unconstrained as (u, t) > (00, 00). Hence, we get: 


A 
p < min f(x) + fils) 
=) 
< d*. 


This completes the proof. 


Proof of (2.92): u # 0 The proof is by contradiction. Suppose u = 0. 
Then, (2.90) implies that: 


Au>0 Yu, t) ES. (2.96) 


Here one thing to notice is that 2 (which was claimed to be non-negative in (2.89)) 
is strictly positive: 


A> 0. (2.97) 
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Why? Otherwise, (A, u) = 0 and this contradicts with the separating hyperplane 
theorem. Next, consider fi (x). Actually we have never used the mild condition thus 
far: dx such that fi (x) < 0. This is where the mild condition kicks in. Due to the 
condition, there exists a point, say x, such that: 


fi®) <0. (2.98) 


Given x, pick up (u, £) such that fi (x) < u < 0 and f(x) < t. Applying this to the 
definition (2.87) of the set S, we see that (u, t) € S. But for such u, 


hu <0 (2.99) 


due to (2.97) and u < 0. This contradicts with (2.96), thus completing the proof. 
Multiple-inequality-constrained optimization Next consider a multiple- 
inequality-constrained convex optimization: 
P = minf (x): fix) < 0 ie {l,..., m}, 
d* := A). 
ee oh ) 


The proof of p* < d* is almost the same as in the single-inequality case. The only 
distinction here lies in the definition of the smartly-manipulated set: 


S= {(u, t) e R”! ; Ax such that f(x) < u; ie {1,..., m} 


(2.100) 
and f(x) < t} 


where u € R”. Like the single-inequality case, one can readily show the following 
to prove p* < d*: 


1. S is convex; 


2. There exists a hyperplane il e R”*! Æ 0 such that 


T 
umt ES => H ({"| = |) > 0; (2.101) 
u t 
> 0, u > 0; 
<t 


3. A 
4, p* + oH V(u, t) E€ S; 
5. p* < ming f6) + E fie) = g (4) <a". 
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General optimization Lastly consider a general convex optimization problem 
which has now both equality-&-inequality constraints: 
P = minf (x): fi) <0 = 1m 
Ax — b = 0; 


where A € R?*4, p < d and rank(A) = p. Again the key distinction is in the 
definition of the smartly-manipulated set: 


S= {(u, v, t) eRe?! : Jx s.t. f(x) < ui e {lic 1m}, 


(2.102) 
Ax— b= v,f (x) < t} 


where u € R” and v € R?. The procedure would look very much complicated. 
But the key procedure follows simply a combination of the ideas that we studied 
while tackling the above simpler cases. Concretely, one can show the following to 
prove p* < d*: 


1. S is convex; 


A 
2. There exists a hyperplane | v | e R”*?*! 4 0 such that 
u 
JK 0 
(u,v, t) E S => |v v|—]|0 > 0; (2.103) 
u * 


3. A> 0, u > 05 : 
4. Parke ry V(u, v, t) € S; 


. vI = v * 


i=l u p’ u 


You will also have a chance to dig into details of the proof in Prob 6.3. 


Look ahead In Part I, we studied a variety of convex optimization problems: all 
the problems in Fig. 2.9. However, we did not learn how to design generic algo- 
rithms that can be applied to arbitrary scenarios. To this end, during the past sec- 
tions, we learned about strong duality and KKT conditions. We then studied a 
generic algorithm building upon them: the interior point method. And in this and 
last sections, we proved the strong duality theorem that we deferred proving earlier. 
So we are essentially done with the convex optimization story. 
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Convex Optimization 


Least-squares Linear Program 
1800s (LP) 1939 


Quadratic Program (QP) 1956 
Second-Order Cone Program 
(SOCP) 1994 


Semi-definite Program 
(SDP) 1994 


Figure 2.9. Convex optimization instances that we explored in Part I. 


What is next? We may want to ask some interesting questions that can spark 
future studies. One very natural follow-up question is: What about for non-convex 
optimization? Can the techniques that we have learned w.r.t. convex optimization 
problems help address the general case? Fortunately, it is indeed the case. It turns 
out those techniques can help approximating optimal solutions in general problems. 
In order to understand what it means, we need to study another important theory, 
so called weak duality, which forms the content of the next section. 
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Problem Set 6 


Prob 6.1 (Separating hyperplane theorem) Suppose that S and S’ are two 
disjoint non-empty convex sets which do not intersect, i.e., SMS’ = Ø. The 
separating hyperplane theorem that we mentioned in Section 2.3 says: There exist 
a+ 0 eR” and b € R such that 
xEeS = aľx— b > 0; 
(2.104) 
xe S’ = gi x—b<0. 


This problem explores the proof of this theorem. Define the Euclidean distance 
between the two sets S and S’ as: 


|S — S’|| := min{||s — |] : s€ S,” E€ Sh. (2.105) 


(a) State the definitions of the closed set, the open set, and the bounded set. 
(4) Suppose: 


S and S' are closed and one set, say S, is bounded. (2.106) 


Show that ||S — S’|| is positive and there exist points s € S ands’ € S’ that 
minimize ||S — S’||, i.e., ls — s’|] = [|S — S’ ||. 

(c) Suppose (2.106) holds. Prove the separating hyperplane theorem. 

(d) Consider the set D := {s — s : s e S, € S’}. Show that D is convex and 
does not contain the origin. 

(e) Suppose (2.106) does not necessarily hold. Prove the separating hyperplane 
theorem. 


Prob 6.2 (Proof of the strong duality theorem: Exercise 1) Consider a 
convex optimization problem: 


p* = min f(x) fie) < 0; a'x—b=0 
where a € R? and $ € R. The dual problem is then: 
da a 
where g(4, v) indicates the dual function. Define a set: 
S = {(m v, t) € RÌ : Ix s.t. f(x) < ma'x — b= v, f(x) < t}. 


Assume that p* is finite and there exists ¥ € R@ such that 


Ae) <0, a7’x—b=0. (2.107) 
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(a) Prove that p* > d*. 
(6) Prove that S is convex. 
(c) State the separating hyperplane theorem. Use the theorem to prove that 
there exists (A, v, u) Æ 0 such that 
up < hu+vv+ ut Vv te S. (2.108) 
(d) For (A, u) in part (c), show that 
AO = 0: (2.109) 


(e) Prove that p* < d*. 


Prob 6.3 (Proof of the strong duality theorem: Exercise 2) Consider a 
convex optimization problem: 


= min f(x) : fi(x) < 0 z€{l,...,m}; Ax -b6=0 
xER 


where A € RP, b e R?, p < d and rank(A) = p. The dual problem is then: 
d* := A, 
mi ge) 


where g(A, v) indicates the dual function. Define a set: 
S ={(u,v,t) € RP"! Ay s.t. fix) < uj, Vi, Ax — b = v,f (x) < t} 


where u € R” andv € R?. Assume that p* is finite and Slater’s condition is satisfied, 
i.e., there exists ¥ € R? such that 


fi&) < 0,..., fa) <0,4x — b= 0. (2.110) 
(a) Prove that p* > d*. 
(b) Prove that S is convex. 


(c) Using the separating hyperplane theorem, show that there exists (A, v, u) # 
0 such that 


up’ < ATu+vv+ ut Y(mv,t) €S. (2.111) 


(d) For u in part (c), show that u > 0. 
(e) Prove that p* < d*. 
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Prob 6.4 (Trace tricks) Consider a square matrix A € R”*”. Denote the trace 
of a square matrix by: 


trace(A) := > A (2.112) 


i=l 
where Aj; indicates the 7th diagonal entry of A. 


(a) Suppose A > 0. Show that A can be represented as 
A= xXx? (2.113) 


for some X e R”*”, 
(b) Show that AAT > 0. 
(c) Consider A, B e R”*”. Suppose A > 0 and B > 0. Show that 


trace(AB) > 0. (2.114) 


(d) Let X,Z e R”*” be symmetric matrices. Show that 


> > ae = trace(ZX). (2.115) 


i=1 j=1 
Prob 6.5 (KKT conditions for SDP) Consider an optimization problem: 


p* := min trace(WX) : 
i (2.116) 
X>=0, X%;=1, ve {l1,...,d} 


where W e R?*4, X e R?*4 and X; indicates the ith diagonal entry of X. Let 
Z € R?*4 be symmetric and v € R”. Define the Lagrange function as: 


d d d 
L(X, Z, v) = trace(WX) + X > ZCX) + $ vs -— Xa). (2.117) 


i=1 i=1 i=1 
(a) Is the problem (2.116) convex? If so, specify the class of the problem. 
(6) Show that 
L(X, Z, v) = trace((W — Z — D,)X) +071 (2.118) 


where Dy := diag(v1,..., vq). 
(c) Derive the dual function. Also state the dual problem with a constraint 
Z>~O0. 
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(d) Let d* be the optimal value of the dual problem that you formulated in part 
(c). Derive necessary conditions for strong duality to hold (i.e., p* = d*) 
in this problem context. 

(e) Are the necessary conditions also sufficient for p* = d*? If so, prove the 
sufficiency. Otherwise, are there any further necessary conditions that are 
also sufficient for p* = d*? 


Prob 6.6 (Strong duality theorem for SDP) Consider the same optimization 
problem (2.116) as in Prob 6.5: 


p“ := min trace(WX) : 
j (2.119) 
X > 0, X;=1, ie {l1,...,d} 


where W e R¢*%4, X e R2*4 and X; indicates the ith diagonal entry of X. 
The dual problem is then: 


d* := Z,v 
max g( ) 


where g(Z, v) indicates the dual function, Z € R¢*“ is a Lagrange multiplier for 
the inequality constraint, and v € R is the equality constraint counterpart. 
Define a set: 


S = {(u,v,t) ER” *P*! . Av s.t. fi(x) < uj, Vi, Ax — b = v, f (x) < t}. 
Assume that p* is finite and there exists x € R? such that 
fA) < 0,..., fa) <0,Ax-b=0. (2.120) 


(a) Prove that p* > d*. 

(6) Prove that S is convex. 

(c) Using the separating hyperplane theorem, show that there exists (A, v, u) # 
0 such that 


up <Atutviovtut Vw, €S. (2.121) 


(d) For u in part (c), show that u > 0. 
(e) Prove that p* < d*. 


Prob 6.7 (True or False?) 


(a) Consider a convex optimization problem: 


min f (x) fio (2.122) 


Problem Set 6 161 


(b) 


(c) 


where fi (x) is a convex function. Let 
S = {(u, t) € R? : Jx such that fi (x) < u, f(x) < t}. (2.123) 


Then, there exists (0, £) € S. 
Suppose that S and S’ are two disjoint non-empty convex sets which do not 
intersect, i.e., SNS’ = Ø. Then, there exists the unique pair of a # 0 € R? 
and $ € R such that 

xeS = a'x—b>0; 


2.124 
xe S’ = a x—b <0. l ) 


Suppose that S and S’ are two disjoint non-empty convex sets which do 
not intersect, i.e., SS’ = Ø. Suppose there exists the unique pair of 
a #0 eR? and be R such that 

xeS = ax —b > 0; 


(2.125) 
xe S’ = gi x—b <0. 


Then, at least one among S and S’ must be open. 
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2.5 Weak Duality 


Recap In Part I, we investigated many instances of convex optimization prob- 
lems, ranging from LP, Least Squares, QP, SOCP and all the way up to SDP. See 
Fig. 2.10. But in Part I, the algorithm part was not complete. We studied only two 
algorithms: the simplex algorithm and gradient descent. These can be applied only 
to specific problem settings: LP and some classes of QP. In other words, we did not 
learn how to design generic algorithms intended for general convex optimization. 
To this end, during the past sections, we learned about strong duality and KKT 
conditions. We then studied a generic algorithm inspired by them. That was, the 
interior point method. Over the last two sections, we proved the strong duality theo- 
rem which forms the basis of the interior point method. With all of these, we could 
finally end the convex optimization story. 

A natural follow-up question that one can think of is: What if optimization 
problems of interest are non-convex? Can the techniques that we have studied so 
far help saying something about non-convex optimization? It turns out the answer 
is yes. Interestingly it has been shown that the techniques can serve to approximate 
optimal solutions of such non-convex optimization. In fact, in order to understand 
what it means, we need to study another important theory. That is, weak duality. 


Outline In this section, we are going to explore weak duality in depth. Specifically 
what we are going to do are three folded. First of all, we will figure out what weak 
duality means. Like strong duality, there is a relevant important theorem, named 
the weak duality theorem. So in the first part, we will prove the theorem. Next we will 
investigate how weak duality can serve the claimed role: help approximating non- 
convex optimization problems. Lastly we will discuss how good the approximated 
solution is. To this end, we will figure out a gap to the optimality of the original 
optimization. 


Convex Optimization 


Least-squares Linear Program 
1800s (LP) 1939 


Quadratic Program (QP) 1956 
Second-Order Cone Program 
(SOCP) 1956 


Semi-definite Program 
(SDP) 1994 


Figure 2.10. Convex optimization problems that we explored in Part |. 
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Primal & dual problems Like strong duality, weak duality is stated in the con- 
text of primal and dual problems. To define a primal problem, let us first recall the 
standard form of general optimization problems that we introduced in Section 1.2: 


min f(x) : f(x) < 0, ¿€ {1,..., m}; 
xeR? (2.126) 
Bie) =0, że Ne ai} 


where (f(x), fi(x), 4;(x)) are arbitrary scalar functions (R? > R), not necessarily 
convex or affine functions. Since we start with the above problem, we can say that 
the problem is primal. 

In order to figure out the dual problem, we need to come up with the Lagrange 
and dual functions. First consider the Lagrange function: 


m Pp 
L(x, A,v) = fe) + > Af) + D> vihi) (2.127) 
i=1 i=1 


where A = [A1,...,Aml? € R” andv = [vis Vo] e R? are Lagrange 
multipliers. The dual function is then defined as: 


gC, v) := min L(x, A, v). (2.128) 
Here ¥ indicates the entire space that the optimization variable x lies in: 
X := domf N domfi 1---Odomf,, N domh; N --- N domh,. 
Using this dual function, we can formulate the dual problem as: 
(Dual problem): max gl, v): A> 0. (2.129) 
What does weak duality mean? Like strong duality, we can state weak duality 
using the optimal values (p*, d*) of the primal and dual problems: 
(Primal): p* := min f(x): f(x) <0, 1 < i < m, hi) =0, 1< i< p; 
(Dual): d* := maxg(4, v) :4>0. 


What weak duality means is that the optimal values of the two problems respect: 
(Weak duality): p* > d*. (2.130) 


As you may notice, we already saw weak duality (2.130) before. Actually we saw 
twice; one in Section 2.1, and the other in Section 2.3. But we proved this is the 
case only for a specific context: the convex optimization context. Here the most 
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critical point that we would like to emphasize is that weak duality (2.130) holds 
for any optimization problems which include non-convex optimization, i.e., no 
matter what the function types of (f(x), f(x), A;(x)) are. We call this the weak 
duality theorem. 


Proof of the weak duality theorem (2.130): p* > d* The proof is almost 
the same as before. The proof is not tailored for the convexity condition, i.e., it 
can carry over to non-convex optimization as well. Let us verify that it is indeed 
the case. 

Suppose that a feasible point in the primal problem, say x*, achieves p*; similarly, 
another feasible point in the dual problem, say (A*, v*), achieves d*. Using the 
fact that x* and (A*, v*) are the minimizer and maximizer of the primal and dual 
problems respectively, we get: 


p =f") 


(a Z id 
2 £6) + aK) + Dt) 


i=1 i=1 


(2.131) 


IV 


m P 
min f(x) + DAAC) + Dd vh 


i=1 i=1 


where (a) follows from the fact that fi(x*) < 0,47 > 0 and /;(x*) = 0 fora 

feasible point (x*, A*, v*); and (0) is due to the definition of the dual function. 
Take a careful look at the procedures in (2.131). We never used anything about 

function types of (f (x), fi(x), 4;(x)). Hence, weak duality holds for any optimiza- 


tion. 


Why does weak duality matter? You may wonder why we brought up the 
weak duality theorem. As claimed earlier, it can serve to approximate the primal 
non-convex optimization problem. To figure out what it means, let us first point out 
one critical fact about the dual problem. The critical fact is that the dual problem is 
always convex no matter what the optimization type is, and therefore it is tractable 
(i.e., solvable). The convexity of the problem then allows us to simply focus on the 
tractable dual problem. It turns out by solving the tractable dual problem, one can 
obtain an approximated solution of the original non-convex primal problem with 


a gap of p* — d*. 
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For the rest of this section, we will provide details on the above. First we will 
prove that the dual problem is indeed convex. We will then figure out how the gap 
p“ — d* comes up. 


Proof of the convexity of the dual problem Let us prove that the dual prob- 
lem is always convex no matter what the function types of (f (x), f; (x), 4;(x)) are. 
Let us start by considering the dual function: 


m $ 
gv) = min fe) + D AAO + D vibile). 


i=1 i=1 


Here one can make two key observations. The first is that for a particular value of x, 


m P 
fe) + DAC) + D vibile) 
i=1 i=1 
is affine in (A,v), for any types of the functions (f(x), f;(x), 4;(x)). The second 
observation is that taking the minimum of any affine functions, we obtain a concave 
function. Why? This is what we checked before. Hence, 


g(A, v) is always concave in (A, v), 


no matter what the function types of (f(x), fi(x), 4;(x)) are. Therefore, the dual 


problem that maximizes the concave function is convex optimization. 


How to solve the dual problem? As mentioned earlier, weak duality together 
with the convexity of the dual problem motivates us to focus on solving the dual 
problem to obtain an approximate solution. So let us first discuss how to solve the 
dual problem: 


(Dual) d* := maxg(A,v) : A>O. (2.132) 


Notice that the dual problem contains an inequality constraint. So one cannot rely 
simply upon gradient descent. We should instead employ another more sophisti- 
cated algorithm. One such algorithm that we studied in Section 2.2 is: the interior 
point method. Remember that it takes the following two procedures: 


1. We first approximate the problem into an unconstrained problem via the 
logarithmic barrier function: 


LB(z) := — u log(—z), for some u > 0. (2.133) 


But we should employ a slight variant of LB(z) in approximation. Since the 
interested dual problem is about maximization (instead of minimization), we 
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Figure 2.11. Shape of the logarithmic barrier function for different w’s. 


should employ the minus version of LB(z) so that it takes —oo (instead of 
+00) when z > 0: 


—LB(z) = +u log(—z), for some u > 0. 


Applying this to the dual problem (2.132), we can obtain an approximated 
unconstrained optimization: 


(Approximated optimization) max g(A,v) + u > log Aj. (2.134) 
wv 


i=1 


Notice that log 2; comes from log(—(—A;)) in the standard form —/ < 0 
of the inequality constraint. 


. The next thing to do is to apply alternating gradient descent into the approx- 


imated optimization (2.134). The Lagrange function of the approximated 
optimization reads: 


Lapp(A,v) = g(a,v) + # >) log dj. (2.135) 
i=l 


Alternating gradient descent then allows us to find a stationary point, say 


(A, 0): 


ViLapp(A, 0) = 0; Vy Lapp(A, 0) = 0. (2.136) 


Performance of the interior point method Remember that the performance 


of the interior point method depends on the control parameter u that appears in 


the logarithmic barrier (2.133). To see how it depends, consider the stationary point 
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(A, 0). Then, the dual function evaluated at the point should be larger than or equal 
to d*, as the dual problem is about maximization. Hence, we get: 


d* > g, D). 


Also remember what we derived w.r.t. the gap in Section 2.2. Therein, we con- 
sidered the minimization problem. In the context, what we showed is that the gap 
to the optimality is upper-bounded by mu. Actually we can apply the same trick 
to obtain the same bound even w.r.t. the maximization problem of this section’s 
interest. In other words, we can derive: 


d* — g(A,d) < mu. (2.137) 


If you are not crystal clear about this, you may want to derive it, by consulting with 
Prob 5.3. From (2.137), we can now say that for a sufficient small value of 4, 


g(A,0) x d. 


Performance gap to optimality p* So under a choice of sufficiently small u, 
the gap to the optimal value p* in the primal problem would be: 


Gap © p* — d*. 


There is a terminology which indicates the gap. Since the gap is w.r.t. the dual 
problem, it is called the duality gap. 

Let us introduce another naming which refers to the approximation technique 
based on weak duality. We call the technique Lagrange relaxation. Why? Notice that 
the dual problem yields a smaller solution (d* < p*), and the primal problem is 
about minimization. Hence, we can interpret the dual problem as the one with more 
relaxed constraints. Also it is about duality, so it employs the Lagrange function in 
the process. That’s why we call it Lagrange relaxation. 

Whenever we study a relaxation technique, we need to worry about how good 
the relaxation technique is. So one can ask: How far is the bound due to Lagrange 
relaxation from the optimal value p*? 


How good is Lagrange relaxation? Here we intend to address the question 
by comparing to other relaxation techniques that we investigated earlier. The first 
relaxation technique is the one that we studied in Section 1.5 in the context of LP. 
That was, LP relaxation. The second is the one that we studied in Section 1.14 
for a different problem context. That was, SDP relaxation. So we are interested 
particularly in how good Lagrange relaxation is relative to LP and SDP relaxation 
techniques. 
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Lagrange relaxation vs. LP relaxation First let us compare to LP relaxation. 
Actually such relaxation arose in a very difficult problem class that we studied 
before: the Boolean problem. So let us make a comparison under the Boolean prob- 
lem. It turns out that for such a problem: 


Lagrange relaxation has the same performance as that of LP relaxation. 


Lagrange relaxation vs. SDP relaxation What about for SDP relaxation? It 
turns out that in general, 


Lagrange relaxation is at least as good as SDP relaxation. 


Another interesting result comes in for a variety of important and classical problem 
instances including the MAXCUT problem that we studied in Section 1.14. It turns 
out that for such problem instances: 


Lagrange relaxation and SDP relaxation yield the same performance. 


Look ahead During upcoming sections, we will study how to implement 
Lagrange relaxation for the interested Boolean and MAXCUT problems. 
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2.6 Lagrange Relaxation for Boolean Problems 


Recap In the previous section, we switched gears to move onto non-convex opti- 
mization. We emphasized that the techniques that we have learned so far play a 
crucial role in approximating non-convex optimization. To understand how the 
approximation is implemented, we needed to study another important theory 


regarding duality. That is, the weak duality theorem: 
p > d* holds for any optimization problem. (2.138) 


Here p* and d* indicate the optimal values of the primal and dual problems, respec- 
tively. One important thing that we proved next is that the dual problem is always 
convex no matter what the types of the functions that appear in the optimization 
problem are. This motivates us to focus on solving the solvable dual problem, and 
one can solve it using the interior point method. Assuming that we use a suff- 
ciently small value of the control parameter u that appears in the logarithmic barrier 
employed for the interior point method, we can achieve d* approximately, which 
in turn ensures the performance gap to the optimality to be roughly p* — d*. This 
gap is called the duality gap, and the approximation technique that leads to the gap 
is called Lagrange relaxation. At the end, we mentioned that Lagrange relaxation 
plays a significant role especially for two interested non-convex problems that we 
investigated earlier: Boolean and MAXCUT problems. 


Outline In this section, we are going to investigate how to implement Lagrange 
relaxation for one such problem: the Boolean problem. Specifically what we are 
going to do are four folded. First of all, we will review the Boolean problem. Next 
we will express the Boolean problem in the standard from. We will then employ 
Lagrange relaxation to derive the dual problem. Finally we will show that the trans- 
lated dual problem belongs to one convex instance that we studied in Part I: SDP. 


Review of Boolean problems Remember that we discussed the Boolean prob- 
lems in the context of LP. So the problems are basically on top of LP, but one special 
constraint is added. That is, the optimization variable x takes binary values. So the 
problem respects the following form: 


pe :=minw’ x: 


Ax —6<0, Cr -—e=0, (2.139) 
x; € {0,1}, ze {l,...,d}. 


Here the last additional constraint says that the optimization variable is constrained 
to be boolean. 
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Standard form Before studying how to implement Lagrange relaxation, let us 
convert the above into the standard form that contains only canonical types of 
constraints: inequality and/or equality constraints. Notice that the boolean con- 
straint above (marked in red) is neither of inequality nor of equality type. But the 
translation to an equality constraint type is easy. The boolean constraint can be 
equivalently expressed as follows: 


xi(x; -1) = 0 ze {l,...,d}. (2.140) 


We now have two types of equality constraints: one is the affine constraint Cx — 
e = 0; and the other is the above (2.140). In an effort to exhibit only one type of 
equality constraints, let us convert the affine constraint Cx — e = 0 into inequality 
constraints. This way, we can merge them with Ax — b < 0, making notations 
simpler. We have already known how to convert the affine constraint. That is to 
represent it with two inequality constraint: Cx — e < 0 and Cx — e > 0. Taking 
the conversion, we can then simplify (2.139) as: 


pe = minw? x: 


Ax —b<0, 
xj(x; = 1) = 0, ze {1,...,d}. (2.141) 


where (A, 4) are not the same as those in the original optimization (2.139), but 
being the merged version with Cx — e < 0 and Cx — e > 0. 


Lagrange function Let us think about how to implement Lagrange relaxation. 
To this end, first consider the Lagrange function: 


d 
L(x, A, v) = wx + A7 (Ax — b) + > vaii 1) 


i=1 


d 
(wt ATA —v)"x—ATb+ > vix? (2.142) 


i=1 


= (w+ A11 — v)! x — ATb + x" diag(vy,...,0¢)x 

= (w+A11 —v)!x —ATb +x" Dox 
where the last equality follows from the definition of D, := diag(v1,..., vg). 
Dual function Next consider the dual function: 


gli, v) = min x! Dyx wes ta vy x — AT% (2.143) 
xe 


Lagrange Relaxation for Boolean Problems 171 


where ¥ denotes the entire space (RY in this example). Here g(4, v) looks like a 
QP, as it contains a quadratic term x! Dyx and Dy is symmetric. But it is not clear 
whether the associated optimization is indeed a QP. Why? The reason is that D, is 
not necessarily positive semi-definite (PSD). Notice that D, is what we can opti- 
mize over in view of the dual problem taking v (together with 4) as an optimization 
variable. 

Depending on what to choose for D,, we can think of two cases. The first is sort 
of an easy case where the problem is indeed a QP and therefore the solution can be 
readily derived. In the easy case, the symmetric matrix D, respects: 


Case I: Dy > 0. (2.144) 


In this case, Dy is indeed PSD, so the optimization problem in (2.143) becomes 
an unconstrained QP. So one can solve it by finding the stationary point x* that 
satisfies: 


V,L(x*,4,v) = 2Dyx* +w +AT -v =0. (2.145) 


But there is a caveat here. The caveat is that there may be no solution x* that satis- 
fies (2.145). To see this clearly, let us consider two subcases depending on whether 
the solution exists. 

The first is the no-solution case: 


Case I-1: D, = 0, wt+tAj—v ¢ range(D,). (2.146) 


In this case, there is no stationary point x*. This implies that the Lagrange function 
L(x, A, v) can decrease arbitrarily small, being all the way down to —oo. What 
people in the literature say about this behaviour is that L(x, A, v) is unbounded 
below. So in this case, the dual function g(4, v) is obviously —oo. This is definitely 
not an interested case. Why? The dual problem (that we will formulate soon) is 
about maximization, so g(A,v) = —œ is definitely not the one that we want to 
achieve. The second is the case in which there is indeed a solution: 


Case I-2: Dp = 0, w fA Lope range(D,). (2.147) 


This is definitely of our interest. In this case, we can solve such x* by resorting to 

the generalized definition of D7 !: 

1 ifv; Æ 0; 
i = 


LD; (2.148) 


0 ifv; = 0. 


This generalized definition is needed because v; = 0 makes v7 ' blow up. In the 
case of v; = 0, the corresponding x; can be anything, so x; = 0 could be a feasible 
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point. Hence, one can set the minimizer x* simply as: 
1 
= -5D (w +AT =v). (2.149) 


Plugging this into the objective function in (2.143), we can obtain the dual func- 
tion as: 


g(A,v) = L(x", A, v) 


1 
= qwtAan — v) (D7 !)T D, D7! (w+ A71 — v) 


1 (2.150) 
— star — v)! D7 (w+ ATA — v) — 47b 
1 
= -zw +A"4 — v)! D7 (w+ ATA — v) — 47b. 
On the other hand, one may want to consider the other case: 
Case I: Dy ja 0. (2.151) 


Being D, 7 0 means that there exists v; < 0 for some 7. This is obviously not an 
interested case. Why? Recall the expression (2.142) of the Lagrange function: 


L(x, à, v) =(wtAla—v) x — 47b + xT Dyx. (2.152) 


By setting x; = 00 for such ż, we get L(x, À, v) = —oo. This then leads to g(4, v) = 
—oo. Again this is not what we want to achieve because the dual problem is about 
maximization. 


Dual problem In summary, what we are interested in is only Case I-2 (2.147) 
wherein g(4, v) is finite. Focusing on this case, we obtain the dual problem as: 


1 
max —-(w+ ATA — v)" D] (w+A"1-—v)— Ab: 
A>0,v>0 4 (2.153) 


wtAll—ve range(D,,) 


where the constraint v > 0 comes from D, > 0 (2.144) and the dual function 
g(A,v) is due to (2.150). As mentioned earlier, the definition of D>! is subject 
to (2.148). 

Notice in the above that the first term in the objective function (marked in red) 
is not compatible with the standard form of any convex optimization problem that 
we studied earlier. But interestingly it turns out we can convert it into a form that 
we are familiar with: SDP. So for the rest of this section, we will translate (2.153) 
into an SDP. 
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We first introduce a new optimization variable, say ż, such that 
1 
t< -zw +44 =v)" D (w+ A71 — v). 


A key observation here is that by maximizing ¢, one should make the RHS larger, 
and also by maximizing the RHS, one can set ¢ larger. So this implies that max- 
imizing ¢ is equivalent to maximizing the RHS. Hence, with this new variable ¢, 
one can write (2.153) as: 


max g(j,v)= max ż— Avo: 
A>0,v>0 A>0, v>0, ¢ 


w+A™A—v erange(D,); (2.154) 
t< — Zw AT -= v)" D; (w+ A"1 — v). 
The newly-introduced inequality constraint here can be alternatively written as: 
—t — (w+ ATA =v)" (4D) (w + AA — v) > 0. (2.155) 


This is the one that you should be familiar with. Why? This reminds you of 

the Schur complement lemma. More precisely, the generalized Schur complement 

lemma. Notice that the interested matrix D, is PSD, not necessarily PD. 
Generalized Schur complement lemma: Suppose A = 0. Then, 


X= É 4 = 0 = S:= C — BAB > 0; Bue range(A) Ww 

(2.156) 
where At := UELUT when A = ULU™. Here X7! is subject to the definition 
of (2.148). E 


As per the above generalized lemma, as long as we define D7! as the one 
in (2.148), the two constraints in (2.154) are equivalent to: 


4D, w+ATA—v 
= 0. A 
usani - |=0 (2.157) 
Taking this conversion, we can write (2.154) as: 
d* := maxt—A'b: 
À, v, t 
4D, w+ATA—v 
= (2.158) 
wath vy? —t |=0 


A>0,v > 0. 
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Notice that all the inequalities are the ones that we are familiar with. These are all 
LMIs. So this problem belongs to an SDP. 


Look ahead In the next section, we will study how to implement Lagrange relax- 
ation for another interested problem: the MAXCUT problem. 
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2.7 Lagrange Relaxation for the MAXCUT Problem 


Recap In the previous section, we studied how to implement Lagrange relaxation 
in the context of the Boolean problem. We derived the dual function in closed 
form. We then employed the generalized Schur complement lemma to translate 
the dual problem into a tractable SDP. In an effort to say a few words about the 
performance of Lagrange relaxation, we also discussed in brief the comparison to LP 
relaxation. 


Outline In this section, we will do the same thing yet w.r.t. a different prob- 
lem: the MAXCUT problem, explored in Section 1.14. Specifically we are going 
to cover the following five stuffs. First off, we will review what the MAXCUT prob- 
lem is. We will then express the MAXCUT problem in the standard form. Next we 
will derive the dual function in closed form, by employing the technique that we 
learned in the prior section. In the fourth part, we will show that the dual prob- 
lem can be formulated as an SDP. Lastly we will discuss comparison to SDP relax- 
ation that we employed for the purpose of approximating the MAXCUT problem in 
Section 1.14. 


Review of the MAXCUT problem Let us start by reviewing the MAXCUT prob- 
lem. The goal of the problem is to find a set that maximizes a cut. See Fig. 2.12 to 
refresh your memory. 

To formulate an optimization problem, we introduced an optimization variable 
x; that indicates whether node 2 is in a candidate set S: 


(2.159) 


fay xES; 
xj = 


—1, otherwise. 


S={1,3,5} 8° ={2,4,6} 


Figure 2.12. The MAXCUT problem in which the goal is to find a set that maximizes a cut. 
In this example, the set S = {1,3,5} and the cut w.rt. the set S is w54 + w14 + w36 + w12 + w32. 
Here w; denotes a weight associated with an edge (i,j) € E. 
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We then made a key observation. When x; 4 xj, the edge (ż, j) comes across the 
two sets S and S°, and hence, this should contribute to a cut by the amount of wy. 
On the other hand, when x; = x;, there should be no contribution to the cut. This 
yielded the following optimization: 


d d 1 
max D1 5 Mill — xix) : 


i=1 j=l (2.160) 
a=, ied} 
where d denotes the number of nodes in the graph and the constraint x? = 1 


respects the fact that x; can be only either +1 or —1. Here we allow for double 
counting, as it does not alter the optimal solution. 


Standard form For simplification, let us multiply the objective function by 2. 
Since the standard form is about minimization, we also flip the sign in the objective 
to convert the problem into: 


d d 
(Primal): p* := min >; >» w(x: — 1 
m Ege (2.161) 


x = 1, ie {l,...,d}. 


Say that this is the primal problem that we start with. 


Lagrange function How to implement Lagrange relaxation for the primal prob- 
lem (2.161)? To this end, first consider the Lagrange function: 


d d d 
Lv) =X >> we) A va- x7) 


i=1 j=1 i=1 


2 ye Eo 


i=1 j=1 i=1 j=1 


(2.162) 


where v € R? denotes a Lagrange multiplier associated with the equality con- 
straints. 

Notice that the Lagrange function (2.162) contains some complicated-looking 
terms, which are summation terms. One way to succinctly represent the dirty terms 
is to rely upon matrix and vector notations. To apply this way, consider a matrix, 
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say W, which is defined as: 


w11 W12 t Wid 
w21 W22 `° Wd 

W= 0 2 | e RPS, (2.163) 
Wd| Wd2 °** Wdd 


Note that the matrix W is symmetric, i.e., W = WT, as we consider the undirected 
graph for the MAXCUT problem. We also set w = 0 when (7,7) ¢ E. So the 
diagonal entries must be zero: w; = 0. 

With this W notation, we can represent the summation terms in (2.162) as: 


wil Wi2 +++) Wid 
is a W21 W22 aca W2d 
DD eae e a 
i=j j=1 : : i i 

Wdı Wd2 Wadd | 1 


-1'wi; 


d d d d 
YS psn DvD ws 
j=l 


i=1 j=l i=] 


w11 W12 ° Wid x] 

W21 W22 `° Wd X2 
= [x1, x2, tee Xd] 

Wdy) Wd2 °** Wdd Xd 


Applying the above into (2.162), we then get: 
Liv) = —17W1 +v11 +x" We — xl Dx (2.164) 
where D, := diag(v1, v2, ..., Vg). 
Dual function Next consider the dual function: 


g(v) = minx” (W — Dy)x — wit? (2.165) 
xe 


where X indicates the entire space. Here g(v) looks like a QP, as it contains a 
quadratic term xt ( W —D,)x and W — D, is symmetric. But as we encountered in 
Section 2.6, it is not clear whether the associated optimization problem is indeed a 
QP. The reason is that W — D, is not necessarily positive semi-definite (PSD). So 
let us consider two cases depending on whether W — D, is PSD. 
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The first case is: 
Case I: W — D, > 0 (PSD). (2.166) 


In this case, the optimization problem in (2.165) is an unconstrained QP. So one 
can solve it by finding x* such that 


V L(x", v) = 2(W — D,)x* = 0 (2.167) 


if such x* exists. Notice that there always exists such x* satisfying the optimality 
condition. Hence, by putting x* into L(x, v), we obtain the optimal value as: 


gv) = L(x, v) = TW — D,)x* — Wisp 1 


(2.168) 
=-1/Wi+v'1 
where the last equality comes from (2.167). 
On the other hand, the second case is: 
Case II: W — Dy ja 0 (not PSD). (2.169) 
This non-PSD condition implies that there exists x € R% such that 
T 
x’ (W-D,)x < 0. 
Here by scaling up such x infinitely, we get L(x, v) = —oo, which in turn leads 
to g(v) = —oo. Obviously this is not what we want to achieve, so we ignore the 


second case. 


Dual problem Focusing on the first case (2.166) in which g(v) is finite and we 
have the PSD constraint of W — D, > 0, we can obtain the dual problem as: 


d* :=maxg(v) 
Vv 
=max—-17 Wi +071: (2.170) 
vV 
W-D >0 


where g(v) is due to (2.168). Notice that this problem is the one that we are familiar 
with. That is, an SDP. 


Comparison to SDP relaxation Let us say a few words about the performance 
of Lagrange relaxation relative to another relaxation technique that we applied to 
the MAXCUT problem before (Section 1.14): SDP relaxation. To make a concrete 


Lagrange Relaxation for the MAXCUT Problem 179 


comparison, let us review what the optimization problem due to SDP relaxation 
was. To this end, first recall the primal problem (2.161) that we started with. 


d d 
p = min D DS) wy -1'W1: 
=o (2.171) 


x? =1, ie {I,...,d}. 


Here we represent the problem using matrix-vector notations, in an effort to ease 
comparison with the Lagrange-relaxed problem (2.170), represented with such 
notations. 

In Section 1.14, we employed a technique, called the /ifting, to enable SDP relax- 
ation. The idea of lifting is to raise a vector space that the optimization variable lives 
in, into a matrix space. In order to apply this idea, we introduced a new matrix X 
such that its (7, 7)-entry [X]; is defined as: 


Xij := xix. (2.172) 


A more succinct way to represent this was: X = xx’. This then yielded the follow- 
ing constraints: 


X;=1, X>0, rank(X) = 1. (2.173) 


Dropping the last rank constraint was essentially the key idea of SDP relaxation. 
So the SDP-relaxed problem was: 


d d 
* . T 
Pspp = min > > wij Xj — 1° W1: 


i=1 j=1 
(2.174) 


seeds 2e{l nc) 
X = 0. 


Looking at (2.170) and (2.174), a comparison seems not that straightforward. 
The problem (2.170) is about maximization, while the problem (2.174) is about 
minimization. In order to ease comparison, we can apply the strong duality theorem 
to obtain the dual problem of (2.174). This would be then definitely about max- 
imization. However, there is an issue in formulating the dual problem of (2.174). 
The issue comes from the inequality constraint form in (2.174). Why? The inequal- 
ity is now w.r.t. a symmetric matrix, not a scalar or a vector. Actually we never for- 
mulated a Lagrange function w.r.t. an optimization problem which involves such 
matrix-associated inequality. So we do not know what is a proper Lagrange multiplier 
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for the problem. But it turns out that there is a proper way of defining a Lagrange 
function in the matrix-inequality context. Interestingly, by applying the trick, one 
can show that the dual problem of (2.174) is exactly the same as that (2.170) of 
Lagrange relaxation. This implies that the two relaxation techniques yield the same 
performance. We will not delve into details here. But you will have a chance to 
check this in Prob 7.4 (with the help of Prob. 7.5). 


Summary of Part | and Part || We have thus far studied lots of stuffs for both 
convex and non-convex optimization problems. In Part I, we have investigated 
many instances of convex optimization problems together with some algorithms 
(like the simplex algorithm and gradient descent) that can be applied to certain 
settings. One critical thing that we missed is about the development of generic algo- 
rithms that can be applied to arbitrary settings under the class. 

To fill up the missing part, in Part II, we studied an important concept about 
duality: strong duality. We studied what it means and then figured out that it pro- 
vides algorithmic insights into convex optimization. It leads to a famous algorithm, 
called the interior point method. With strong duality, we could complete the con- 
vex optimization story. 

We then moved onto non-convex optimization problems. What we could figure 
out is that another important theory regarding duality helps approximating optimal 
solutions of non-convex optimization. That is, the weak duality theorem. We also 
studied an approximation technique based on weak duality (called Lagrange relax- 
ation) that offers good performances for some class of difficult problems including 
the Boolean and MAXCUT problems. All of the above form the contents of Part I 
and Part II. 


Outline of Part II! A natural follow-up question arises. What can do we further 
with the techniques that we have learned so far? It turns out we can do something 
crucial in a wide variety of research fields. One such field that is quite trending 
during the past decades is: Machine learning. 

So in Part III, we are going to study how the optimization techniques that we 
learned can play a role in the trending field. In machine learning, there are two rep- 
resentative learning methods: (i) supervised learning in which training data contain 
label information (like the identity of emails among spam vs. legitimate emails); and 
(ii) unsupervised learning in which such label information is not available. Actually 
supervised learning is the one that we already investigated in Part I, specifically in 
the context of spam filter design. In supervised learning, we will explore further, 
particularly the role of convex optimization, in the design of potentially better clas- 
sifiers that yield sometimes the optimal performance in a certain sense. In unsu- 
pervised learning, we will study one very popular technique based on Generative 


Lagrange Relaxation for the MAXCUT Problem 181 


Adversarial Networks (GANs). In this context, we will figure out a powerful role of 
strong duality that we learned in Part II. In addition to these popular methodolo- 
gies, we will investigate the role of the regularization technique and KKT conditions 
in the design of trending machine learning classifiers, called fair classifiers, which 
ensure fairness for disadvantageous against advantageous groups. 
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Problem Set 7 


Prob 7.1 (The dual problem of LP relaxation) Consider an LP which is 
relaxed from a Boolean problem: 


min wlx: 
xER4 


0<x<l 
where A € R”*4 and 6 € R”. Show that the dual problem can be represented as: 


maxt—A'b: 
Asv,t 


wtA atv>0, 
== v1 > 0; 
A>0,v>0 (2.176) 
where 2 e R”, v e R andr eR. 


Prob 7.2 (Lagrange relaxation vs. LP relaxation) Consider the Lagrange- 
relaxed problem of a Boolean problem that we formulated in Section 2.6: 


d* :=maxt—J'b: 


Av,t 
4D, w+tATi—vy 
= (2.177) 
wrai —t |o 
A>0,v>0 


where D, := diag(vı,..., vg), A € R”, A e R”*4, b e R” andt e R. Also 
consider the dual problem of the LP-relaxed problem that we claimed in Section 2.6 
and you proved in Prob 7.1: 


dë, := maxt — Ab: 
A,v,t 


ATA >0, 
URA EWS (2.178) 

ape tS, 

A>0,v>0. 


Show that d*, < d*. 
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Prob 7.3 (Lagrange relaxation) Consider a Boolean problem: 


pe :=minw' x: 
Ax —b <0, (2.179) 
x(x; -— 1) = 0, z€{l,...,d} 


where x, w € RZ; A e R”™*4; and b e R”. Also consider an LP relaxation of the 
problem (2.179): 


pro = minw? x: 
Ax —b <0, (2.180) 


0O<x<l. 


(a) Derive a Lagrange-relaxed optimization problem of (2.179) as an SDP. Also 
explain why it is an SDP. 

(b) Derive the dual problem of (2.180) as an SDP. Also explain why it is an 
SDP. 

(c) Let d* be the optimal value of the Lagrange-relaxed optimization problem 
derived in part (a). Let dřp be the optimal value of the dual problem derived 
in part (4). A student claims that there exists a case in which d*, < d*. Prove 
or disprove this claim. 


Prob 7.4 (Lagrange relaxation vs. SDP relaxation) Consider the optimiza- 
tion problem of the MAXCUT problem that we studied in Section 2.7. 


d d 
p= min $ > wj- xix): x7 = 1 (2.181) 


i=1 j=1 


where w; € R denotes a weight associated with an edge (ż, j) in an undirected 
graph G. 
(a) Using Lagrange relaxation, derive the dual problem of (2.181) as an SDP. 
Also explain why it is an SDP. 
(b) Using SDP relaxation together with the lifting technique, show that the 
optimization (2.181) can be approximated as: 


d d 
Py m E ie ce aor (2.182) 
1= j= 
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(c) Let d* be the optimal value of the dual problem derived in part (a). Let 
dpp be the optimal value of the dual problem w.r.t. the SDP-relaxed opti- 
mization (2.182). A student claims that dj,, = d*. Prove or disprove this 
claim. 


Prob 7.5 (Weak duality for SDP) Consider an optimization problem: 


p = max f (x) : 
di (2.183) 
G+ xF +- + xata = O0 


where G € R”*” and F; e R”*” are symmetric. Assume that f (x) is finite. 


(a) Propose a proper way of defining the Lagrange function, the dual function 
and the dual problem that leads to the weak duality theorem. 

(b) Employing the way proposed in part (æ), derive the dual problem 
of (2.183). 

(c) Let d* be the optimal value of the dual problem derived in part (b). State 
the weak duality theorem that shows the relationship between p* and d*. 
Also prove the theorem. 


Prob 7.6 (Weak duality for SDP) Consider an optimization problem: 
Pp := min f(x) : 
G+xFi +e + xaa = 0 


(2.184) 


where x := [x1,...,x4]T € R, and G € R”*™” and F; € R”*” are symmetric. 
Assume that f(x) is finite. Let Z € R”*” be symmetric. Define the Lagrange 
function as: 


L(x, Z) := f(x) — trace(Z(G + x1 Fi) +--+ + x4Fa)). (2.185) 
(a) Derive the dual problem of (2.184). 
(b) Prove that p* > d*. 
Prob 7.7 (True or False?) 
(a) Consider an optimization problem: 


minf (x) : 


(2.186) 
G+x Fi +---+x7F; = 0 


where x € RY, and G € R”*” and F; e R”*” are symmetric. Then, one 
can derive the dual problem only when f(x) is finite Vx. 
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3.1 Supervised Learning and Optimization 


Recap In Part II, we focused on the study of the two important theorems: 
(i) strong duality theorem; (ii) weak duality theorem. The strong duality theorem 
provided algorithmic insights, thus leading to the interior point method that can 
be applied to generic settings that we have investigated in Part I. The weak duality 
theorem helped us to approximate optimal solutions for non-convex optimization 
problems, which are intractable in general. 

At the end of Part II, we emphasized that what we have learned so far are instru- 
mental in addressing important issues that arise in a trending research field: Machine 
learning. So the goal of Part III is to support this claim. 


Outline In this section, we will start investigating the field of machine learning 
and the role of optimization therein. What we are going to do are three folded. First 
of all, we will study what machine learning is and what the mission of the field is. 
We will then explore one very popular & traditional way to achieve the mission: 


Supervised learning. 


Lastly we will figure out how optimization techniques are related to supervised 
learning. 
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computer system 
(machine) 


t training 
algorithm (together w/ data) 


Figure 3.1. Machine learning is the study of algorithms which provide a set of instructions 
to a computer system so that it can perform a specific task of interest. Let x be an input 
indicating information employed to perform a task. Let y be an output to denote a task 
result. 


Machine learning Machine learning is about an algorithm which is defined to 
be a set of instructions that a computer system can execute. Formally speaking, 
machine learning is the study of algorithms with which one can train a computer 
system so that it can perform a specific task of interest. See Fig. 3.1 for pictorial 
illustration. 

The entity that we are interested in building up is a computer system, which is 
definitely a machine. Since it is a system (i.e., a function), it has input and output. 
The input, usually denoted by x, indicates information employed to perform a task 
of interest. The output, usually denoted by y, denotes a task result. For instance, if 
a task is legitimate-vs-spam email filtering that we studied in Part I, then x could 
be features (e.g., frequencies of some keywords that appear in an email), and y is 
an email entity, e.g., y = +1 indicates a legitimate email while y = —1 denotes a 
spam email. Or if an interested task is cat-vs-dog classification, x could be image- 
pixel values and y is a binary value indicating whether the fed image is a cat (say 
y= 1) ora dog (y= 0). 

Machine learning is about designing algorithms, wherein the main role is to train 
the computer system so that it can perform the task well. In the process of designing 
algorithms, we use something, called data. 


A remark on the naming One can easily see the rationale behind the naming 
via changing a viewpoint. From a machine’ perspective, a machine learns the task 
from data. Hence, it is called machine learning. This naming was coined in 1959 
by Arthur Lee Samuel (Samuel, 1967). See Fig. 3.2. 

Arthur Samuel is one of the pioneers in Artificial Intelligence (AI) which includes 
machine learning as a sub-field. The AI field is the study of creating intelligence by 
machines, unlike the natural intelligence displayed by intelligent beings like humans 
and animals. 
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Arthur Samuel ’59 


Figure 3.2. Arthur Lee Samuel is an American pioneer in the field of artificial intelligence. 
One of his prominent achievements in early days is to develop computer checkers which 
later formed the basis of AlohaGo. 


One of his achievements in early days is to develop a human-like computer player 
for a board game, called checkers; see the right figure in Fig. 3.2. He proposed many 
algorithms and ideas while developing computer checkers. Those algorithms could 
form the basis of AlphaGo (Silver et al., 2016), a computer program for the board 
game Go which defeated one of the 9-dan professional players, Lee Sedol, with 4 
wins out of 5 games in 2016 (News, 2016). 


Mission of machine learning The end-mission of machine learning is achieving 
artificial intelligence (AI). So it can be viewed as one methodology for AI. In light 
of the block diagram in Fig. 3.1, one can say that the goal of ML is to design an 
algorithm so that the trained machine behaves like intelligent beings. 


Supervised learning There are some methodologies which help us to achieve 
the goal of ML. One specific yet very popular method is the one called: 


Supervised learning. 


What supervised learning means is learning a function f (-) (indicating a functional 
of the machine) with the help of a supervisor. See Fig. 3.3. 

What the supervisor means in this context is the one who provides input-output 
paired samples. Obviously the input-output samples form data employed for train- 
ing the machine, usually denoted by: 


{cy}, (3.1) 


where (x), y) indicates the ith input-output sample (or called a training sample 
or an example) and m denotes the number of samples. Using this notation (3.1), 
one can say that supervised learning is to: 


Estimate f (-) using the training samples {(x®, JOYA]. (3.2) 
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1 


{ey VE, 


Figure 3.3. Supervised learning: A methodology for designing a computer system /(-) 
with the help of a supervisor which offers input-output pair samples, called a training 
dataset {(«, y)}™,. 


Optimization A common way to estimate f (-) is via optimization. This is exactly 
how the optimization techniques that we learned are related to supervised learning. 
In view of the goal (3.2) of supervised learning, what we want is: 


y® fe), Vi | {1,...,m}. 


Then, a natural question arises. How to quantify closeness between the two quan- 
tities: y and f(x)? One very common way that has been used in the field is to 
employ a function, called a /oss function, usually denoted by: 


£6, fe). (3.3) 


One obvious property that the loss function €(-,-) should have is that it should 
be small when the two arguments are close, while being zero when the two are 
identical. One prominent loss function that you saw earlier is: the squared error 
loss, introduced by the father of optimization, Gauss: ||y — f(e)|?. 

Using the loss function (3.3), one can then formulate an optimization problem 
as follows: 


min >) 9, f(@)). (3.4) 
fo 


In fact, this is not of the conventional optimization problem structure that we 
are familiar with. In (3.4), the quantity that we optimize over is the function f (-), 
marked in red. 

We never saw this type of optimization, called function optimization. How to 
deal with such function optimization? There is one typical approach in the field. 
The approach is to specify a function class (e.g., linear or quadratic), represent the 
function with parameters (or called weights), denoted by w, and then consider the 
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Frank Rosenblatt ‘57 


Figure 3.4. Frank Rosenblatt (1928-1971) is an American psychologist notable as the 
inventor of Perceptron. One sad story is that he died in 1971 on his 43rd birthday, in a 
boating accident. 


weights as an optimization variable. Taking this approach, one can translate the 


problem (3.4) into: 
min >) £(y, fle) (3.5) 
i=1 


where f (x®) denotes the function f(x), parameterized by w. 

Now how is the optimization problem (3.5) related to convex optimization that 
we have thus far learned about? To see this, we need to check whether the objective 
function €(y, fy (x®)) is convex in the optimization variable w. Obviously the 
convexity depends on how we define the two functions: (i) fy(x®) w.r.t. w; and 
(ii) the loss function €(-, -). In machine learning, lots of works have been done for 
the choice of the functions. 


Introduction of neural networks Around at the same time when the ML field 
was founded, one architecture was suggested for the first function f,,(-) in the con- 
text of simple binary classifiers in which y takes one among the two options only. 
The architecture is called: 


Perceptron, 


and was invented in 1957 by one of the pioneers in the AI field, named Frank 
Rosenblatt (Rosenblatt, 1958). See Fig. 3.4. Interestingly, Frank Rosenblatt was a 
psychologist. So he was interested in how brains of intelligent beings work and his 
study on brains led him to come up with Perceptron which is inspired by the brain 
structure and therefore gave significant insights into neural networks. 


How brains work Here are details on how the brain structure inspired the archi- 
tecture of Perceptron. Inside a brain, there are many electrically excitable cells, called 
neurons; see Fig. 3.5. Here a red-circled one indicates a neuron. So the figure shows 
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Figure 3.5. Neurons are electrically excitable cells and are connected through synapses. 
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Figure 3.6. The architecture of Perceptron. 


three neurons in total. There are three major properties about neurons that led to 
the architecture of Perceptron. 

The first is that a neuron is an electrical quantity, so it has a voltage. The second 
property is that neurons are connected with each other through mediums, called 
synapses. So the main role of synapses is to deliver electrical voltage signals across 
neurons. Depending on the connectivity strength level of a synapse, a voltage signal 
from one neuron to another can increase or decrease. The last is that a neuron takes 
a particular action, called activation. Depending on its voltage level, it generates an 
all-or-nothing pulse signal. For instance, if its voltage level is above a certain thresh- 
old, then it generates an impulse signal with a certain magnitude, say 1; otherwise, 
it produces nothing. 


Perceptron The above three properties about neurons led Frank Rosenblatt to 
propose the architecture of Perceptron, as illustrated in Fig. 3.6. 

Let x be an n-dimensional real-valued signal: x := [x1,x2,..., Xz] T, Suppose 
each component x; is distributed to each neuron, and let us interpret x; as a voltage 
level of the ith neuron. The voltage signal x; is then delivered through a synapse 
to another neuron (placed on the right in the figure, indicated by a big circle). 
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Remember that the voltage level can increase or decrease depending on the con- 
nectivity strength of a synapse. To capture this, a weight, say w;, is multiplied to 
x; so w;x; is a delivered voltage signal at the terminal neuron. Based on another 
observation that people made on neurons that the voltage level at the terminal 
neuron increases with more connected neurons, Rosenblatt introduced an adder 
which simply aggregates all the voltage signals coming from many neurons, so he 
modeled the voltage signal at the terminal neuron as: 


WX] + wx +++ + WyX%p_ = w' x. (3.6) 
Lastly in an effort to mimic the activation, he modeled the output signal as 


1 if wx > th, 


0 ow. 


where “th” indicates a certain threshold level. It can also be simply denoted as 
fox) = Ww? x > th}. (3.8) 


Activation functions Taking the Percentron architecture in Fig. 3.6, one can 
formulate the optimization problem (3.5) as: 


min > (®, Uw! x > th}) : (3.9) 
i=1 


This is an initial optimization problem that people developed, inspired by Per- 
ceptron. However, people immediately figured out there is an issue in solving this 
optimization. The issue comes from the fact that the objective function contains an 
indicator function, so it is not differentiable. Why is the non-differentiability prob- 
lematic? Remember the algorithms that we learned in the past: gradient descent 
and the interior point method. All of them involve derivatives operations. So the 
non-differentiability of the objective function does not allow us to enjoy the algo- 
rithms. 

What can we do? One typical way that people have taken in the field is to approx- 
imate the activation function. There are many ways for approximation. From below, 
we will investigate one popular approach. 


Logistic regression The popular approximation approach is to take sort of a 
smooth transition from 0 to 1 for the abrupt indicator function: 


1 


M ira 


(3.10) 
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1lt+e-? 


> z 


1 


Figure 3.7. Logistic function: ø (z) = ee 


T% is very small; it then grows exponentially with an 


Notice that f(x) ~ 0 when w 

increase in w7 x; later grows logarithmically; and finally saturates as 1 when w?x 
is very large. See Fig. 3.7. Actually the function (3.10) is a very popular one used 
in statistics, called the /ogistic' function (Garnier and Quetelet, 1838). There is 
another name: the sigmoid’ function. 

There are two good things about the logistic function. First it is differentiable. 
Second, it can play a role as the probability for the output in the binary classifier, 
e.g., P(y = 1) where y denotes the ground-truth label in the binary classifier. So it 
is very much interpretable. 

For this function, people came up with a loss function, which turns out to be 
optimal in some sense and expressed as: 


€(,9) = —ylogy — (1 — y) log — 9). (3.11) 


This function is called cross entropy loss (Cover, 1999) and the rationale behind 
the naming will be explained later. Taking the logistic function together with cross 
entropy loss, we can then formulate the problem (3.5) as: 


wT xÒ 


min 5 —y® log (3.12) 


i=1 


— ag \iigg = 
1+ ent x ql I ) log 1+ enw x : 


It turns out this optimization problem is convex and provides the optimal perfor- 
mance in some sense. The name of this classifier is logistic regression. 


Look ahead In the next section, we will show that logistic regression is indeed a 
convex classifier. We will also study in what sense logistic regression is optimal and 
will prove the optimality. 


1. The word /ogistic comes from a Greek word which means a slow growth, like a logarithmic growth. 


2.  Sigmoid means resembling the lower-case Greek letter sigma, S-shaped. 
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3.2 Logistic Regression 


Recap In the last section, we embarked on Part III which focuses on machine 
learning applications. We emphasized that the optimization techniques play a sig- 
nificant role in the field. Specifically we focused on one machine learning method- 
ology: supervised learning, which serves as a powerful and classical methodology in 
achieving the goal of ML. Recall that the goal of supervised learning is to estimate 
a function f;,(-) of a machine using input-output samples: {(x® , y)}”, where m 
denotes the number of training samples. Here w indicates a collection of parameters 
which serve to implement the function f,,(-). For the function fw (+), we studied one 
specific yet historical architecture, inspired by brains’ neural networks: Perceptron. 
It first takes a linear operation with an input, say x, to compute w7 x. It then passes 
it to an activation function to yield an output, say y := f(x). 

Next, we formulated an optimization problem accordingly: 


min DEG, foe) (3.13) 
i=1 


where £(y® , fp (x®)) indicates a loss function which quantifies closeness between 
the ground truth label y® and its estimate fy (x). Since the stair-shaped activation 
function that Frank Rosenblatts introduced initially is not differentiable and thus 
inapplicable to algorithms like gradient descent, we approximated the activation via 


the logistic function f(x) = ra Taking this together with a loss function, 


called cross entropy loss, 


tæl 7) = —ylogy — (1 — y) log — 9), (3.14) 


we obtained a predictor, named logistic regression. We then claimed that logistic 
regression is formulated as a convex optimization problem. We also claimed that 
the predictor based on cross entropy loss (3.14) yields the optimal performance in 
some sense. 


Outline In this section, we will prove this claim. What we are going to do are 
three folded. First, we will show that logistic regression is indeed a convex predictor. 
Next, we will investigate in what sense the predictor is optimal. We will then prove 
the optimality. In addition, we will discuss how to solve logistic regression. 


Objective function in logistic regression Taking the logistic function and 
cross entropy loss (3.14), we obtain: 


min > fce GË, fulx)) 
i=1 
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= min > lce (6. —) 
a= Ire 


= min > —y® log 


i=1 


SS o o 
Leow xÀ G =y" )log Lene” (3.15) 
Proof of convexity Let us first prove that the optimization problem (3.15) is 
indeed convex. Since convexity preserves under addition, it suffices to prove: 


(i) — log 7- is convex in w; 
1 enw 
gow x 
(ii) — log ————— is convex in w. 
1+ enw" 


Since the 2nd function can be represented as the sum of an affine function and the 
Ist function: 


it suffices to prove the convexity of the 1st function. 
Notice that the 1st function can be rewritten as: 


1 T 
— log ————— = log(l +e ” %*). 3.16 
88 wx og( e ) ( ) 


Taking a derivative of the RHS formula in (3.16) w.r.t. w, we get: 
Valgt = 

w lo e = — 

P Leow" 


This is due to a chain rule of derivatives. Taking another derivative of the above, 
we obtain a Hessian as follows: 


—w'! x 

T —xe 
V2 log +e” *) = V,{ ———— 
14e 


Q xxT ew *(] + e =) gh E ae 
7 (1 +e” =) (3.17) 
_ sex ew 


b 
2o 
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Figure 3.8. Logistic regression. 


where (a) is due to the derivative rule of a quotient of two functions and (b) follows 
from the fact that the eigenvalue of xx! isxt x > 0 (why?). As the Hessian is PSD, 
we prove the convexity. 


In what sense logistic regression is optimal? Notice that the range of the 
output ĵ is in between 0 and 1: 


0<9<1. 


Hence, one can interpret this as a probability quantity. The optimality of a pre- 
dictor can be defined under the following assumption inspired by the probabilistic 
interpretation: 


Assumption : y = P(y = 1|x). (3.18) 


To understand what it means in detail, consider the /zkelihood of the ground- 
truth predictor: 


P (pP | eyz). (3.19) 


Notice that the output ĵ is a function of weights w. Hence, we see that assum- 
ing (3.18), the likelihood (3.19) is also a function of w. 

We are now ready to define the optimal w. The optimal weight, say w*, is defined 
as the one that maximizes the likelihood (3.19): 


w* := arg max P (VOY Hye) ‘ (3.20) 
Of course, there are other ways to define the optimality. Here, we employ the max- 


imum likelihood principle, the most popular choice. This is exactly where the defi- 
nition of the optimal loss function, say €* (-, -) kicks in. We say that €*(-, -) is defined 
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as the one that satisfies: 
arg min 5 e* (2.30) = arg max P (6%, Hee ¥ : (3.21) 
i=1 


It turns out the condition (3.21) would give us the optimal loss function €*(-, -) 
that yields a well-known classifier: logistic regression, in which the loss function 
reads: 


€"(y,9) = €ce(y, 7) = —ylogy — (1 — y) log — 9). (3.22) 
Now let us prove (3.22). 
Proof of *(-,-) = €ce(-,-) Usually samples are obtained from different data xs, 
Hence, it is reasonable to assume that such samples are independent with each 
other: 
È, YD), are independent over 7. (3.23) 


Under this assumption, we can rewrite the likelihood (3.19) as: 


(ym ym) PES) 
P (OMe) S SGT 


Q Li P (x9, 7) 
Pi P(x) 


m 
© II P (91x) 
i=1 


where (a) and (c) are due to the definition of conditional probability; and (4) 


(3.24) 


comes from the independence assumption (3.23). Here P(x, y) denotes the 
probability distribution of the input-output pair of the system: 


P(e, y) := PX = x9, Y = y) (3.25) 


where X and Y indicate random variables of the input and the output, respectively. 
Recall the probability-interpretation-related assumption (3.18) made with 
regard to j: 


=PO = 1x). 
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This implies that: 


y=1: POlx) =%; 
y=0: Pol) =1-5. 


Hence, one can represent P(y|x) as: 
POlx) =P -5 . 
Now using the notations of (x, y) and j, we obtain: 
OO 5OY? q = 30) 
BOM = G a-a, 
Plugging this into (3.24), we get: 


P (LOYLI) 


” m (3.26) 
=P) foa- 
i=1 i=1 
This together with (3.20) yields: 
ae max | [GP = 30) 1-y 
i=1 
2 arg max D7» log5 + (1 = 7) log — 5) (3.27) 


i=1 


È arg min X -° log9 — 1 = y)log(d = 3) 


i=1 
where (a) comes from the fact that log(-) is a non-decreasing function and 
m, gopa — jO? is positive; and (ġ) is due to changing a sign of the 
objective while replacing max with min. 


Note that the term inside the summation in the last equality in (3.27) respects 
the formula of cross entropy loss: 


ce(y,jJ) = —ylogy — (1 — y) log(d — 5). (3.28) 
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Hence, the optimal loss function that yields the maximum likelihood solution is: 
rai C, -) = fæl, -). 


Remarks on cross entropy loss (3.28) Before moving onto the next topic 
about how to solve the optimization problem (3.15), let us say a few words about 
why the loss function (3.28) is called cross entropy loss. This naming comes from 
the definition of cross entropy, which is a measure used in the field of information 
theory. The cross entropy is defined w.r.t. two random variables. For simplicity, let 
us consider two binary random variables, say X ~ Bern(p) and Y ~ Bern(q) 
where X ~ Bern(p) indicates a binary random variable with p = P(X = 1). For 
such two random variables, the cross entropy is defined as (Cover, 1999): 


H(p, q) := —p log q — (1 — p) log(1 — q). (3.29) 


Notice that the formula of (3.28) is exactly the same as the term inside summation 
in (3.27), except for having different notations. Hence, it is called cross entropy loss. 

Here you may wonder why H (p, q) in (3.29) is called cross entropy. The rationale 
comes from the following fact: 


(p,q) = H(p) := —plogp — (1 — p) log(1 — p) (3.30) 


where H(p) is a very-well known quantity in information theory, named entropy or 
Shannon entropy (Shannon, 2001). One can actually prove the inequality in (3.30) 
using Jensen's inequality. Also one can verify that the equality holds when p = 4. 
We will not prove this here. But you will have a chance to check this in Prob 8.3(b). 
So from this, one can interpret H (p, q) as an entropic-measure of discrepancy across 
distributions. Hence, it is called cross entropy. 


How to solve logistic regression (3.15)? Let /(w) be the objective function 
normalized by the number m of samples: 


m 


l Diao ali i a(i 
J(w) := 5 > —y log i — (1 — y) log(d — 5). (3.31) 


i=1 


Remember in the beginning of this section that we proved the convexity of the 
objective function. Also the optimization problem (3.15) is unconstrained. Hence, 
we can use gradient descent to find the optimal solution. Specifically one can take 
the following update rule: 


Gert & w® — a®VJ(w®) 
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where w® denotes the estimate of weights at the th iteration, and a indicates 
the learning rate. The gradient of the normalized objective function can be com- 
puted as: 

g Vi” an Vi” 
VJw) = 2 “Pa e (3.32) 


Here the gradient of y (marked in red) can be expressed as: 
(i) „wT x 
a = x050 90) 
y= (1 + enw x9)2 = ee (ay) 


Plugging this to (3.32), we get: 


VJ (w) = Ly 40 {7a oe ee 9] 


i=1 
te 
=— > {5° _ y! 
oe 


Notice that V/(w) is simple to calculate. 


Al boomed in the 1960s but ... As we verified above, logistic regression is the 
best binary classifier in a sense of maximizing the likelihood of training data, assum- 
ing that the overall architecture is based on Perceptron. But as you can see, the Per- 
ceptron architecture is somewhat restricted. So you may guess that the performance 
of logistic regression based on the restricted architecture may not be that good in 
many applications. It turns out this is the case. Actually it was already verified in 
1969 by two pioneers in the AI field, named Marvin Minsky and Seymour Papert; 
see Fig. 3.9 for their portraits. 


Perceptrons 


er ee 
= es * 


a 


Marvin Minsky Seymour Papert ‘69 


Figure 3.9. Two Al pioneers, Minsky & Papert, demonstrated limitations of the architec- 
ture of Perceptron in a book titled “Perceptrons”. Unfortunately, this led to a very long 
depression period in the Al field, called the A/ winter. 
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Alex Krizhevsky llya Sutskever Geoffrey Hinton 


Figure 3.10. A giant in the Al field, Geoffrey Hinton, together with his PhD students, 
Alex Krizhevsky and Ilya Sutskever, developed a deep neural network, named AlexNet, 
intended for image classification (Krizhevsky et al., 2012). AlexNet achieved almost 
human-level recognition performances, which were never attained earlier. This won them 
the ImageNet competition in 2012. More importantly, this recognition anchored the start 
of deep learning revolution. 


They published a book, titled “Perceptrons” (Minsky and Papert, 1969), where 
they demonstrated that the Perceptron architecture cannot implement even very 
simple functions, like XOR. Their results made many people at that time disap- 
pointed. This finally led to a very long depression period of the AI field, called the 


AI winter. 


Al revived in 2012 The AI winter had continued until recently. However, a 
big event happened in 2012, enabling the AI field to revive. The big event was 
the winning of ImageNet recognition competition by the following three people: 
Geoffrey Hinton (a very well-known figure in the AI field, known as the Godfather 
of deep learning) and his two PhD students (Alex Krizhevsky and Ilya Sutskever); 
see Fig. 3.10. 

They built a Perceptron-like neural network but which consists of many layers, 
called a deep neural network (DNN) (Krizhevsky et al., 2012). They then showed 
that their DNN, which they named AlexNet, could achieve almost human-level 
recognition performances, which were astonishing at the time. This enabled deep 
learning revolution in the AI field. 


Look ahead For a couple of next sections, we will investigate deep neural net- 
works (DNNs) and the role of optimization in that context. Specifically we will 
cover the following four stuffs. First we will study the architecture of DNNs. We 
will then formulate a corresponding optimization problem. Next, we will explore 
an efficient way of solving the problem. We will also discuss why DNNs offer 
great performances yet in a brief manner. Lastly, we will investigate how to imple- 
ment DNNs using one of the most popular deep learning software frameworks, 
TensorFlow. 
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Recap During the past two sections, we have studied one of application topics 
of optimization: supervised learning. The goal of supervised learning is to learn a 
function of a machine using training samples. We considered a specific yet brain- 
inspired architecture for the function structure: Perceptron; see Fig. 3.11. Taking the 
logistic function together with cross entropy loss, we obtained logistic regression. 
We then proved that logistic regression is optimal in a sense of maximizing the 
likelihood of training samples. 

At the end of the last section, we mentioned briefly about how the AI field has 
evolved with research on neural networks. While one of the first neural networks, 
Perceptron, enabled the AI revolution in the 1960s, the boom ended shortly after 
publication of a book by Minsky & Papert, titled “Perceptrons”. The book criticized 
limitations of the Perceptron architecture, and this freezed the passion of many 
people working on the AI field, thus leading to the AI winter. 

The AI field boomed again in 2012 by Hinton & his group members. They 
developed a neural network (which they called AlexNet) that demonstrated human- 
level performances of image recognition, thus making people excited about the field 
again. AlexNet is based on a deep neural network architecture (DNN for short). 


Outline In this section, we are going to start investigating deep neural networks 
(DNNs). Specifically we will cover the following four stuffs. First, we will study 
what DNNs mean and the network architecture. We will then investigate how 
DNNs were proposed, i.e., who the inventor was, as well as, what the motivation 
was. Next, we will discuss why DNNs were recently appreciated, in other words, 
why they were not appreciated during the AI winter. Finally we will formulate a 


no layer btw input/output layers 
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input layer output layer 


Figure 3.11. Input and output layers in Perceptron. 
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corresponding optimization problem to start talking about connection to optimiza- 
tion of this book’s focus. 


Terminologies Recall the Perceptron architecture in Fig. 3.11. Before defining 
the deep neural network (DNN), we need to introduce a couple of terminologies. 
The first is the input layer. We say that a collection of neurons which take input x 
is the input layer. Similarly, the output layer is defined as a collection of the output 
neuron(s). A shallow neural network is defined as a network which consists of only 
input and output layers, i.e., there is no intermediate layer between the two, like 
Perceptron in Fig. 3.11. 


Definition of deep neural networks (DNNs) We say that a neural network is 
deep if it has at least one intermediate layer between input and output layers. Such 
in-between-placed layer is called a hidden layer. So a deep neural network is defined 
as a network which contains hidden layer(s). 


Two-layer DNN architecture Here are details on how the DNN looks like. For 
illustrative purpose, let us explain the architecture with a simple setting in which 
there is only one hidden layer, named a 2-layer neural network in the field’; also 
see Fig. 3.12. 


input layer hidden layer output layer 


Figure 3.12. The operation at a neuron in a hidden layer is exactly the same as that in 
Perceptron. 


3. Someone may argue that this is a 3-/ayer neural network as it has input/hidden/output layers. But the con- 
vention in the deep learning field, adopted by many of the pioneers in the field, is that the number of layers 
is counted as the total number of layers minus 1. The reason is that the input layer is not counted in com- 
puting the total number, as it does not have an activation function. Personally this way of defining a network 
is confusing. Nonetheless, we will adopt this convention as it has already been so widely used. 
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Let us consider an operation at the first neuron in the hidden layer. The operation 
is exactly the same as the operation that we saw earlier in Perceptron. First it takes 
a linear combination to yield: 


Zl = whx + wa ++ wy, (3.33) 


where wl 7 indicates a weight associated with x; and the Ist neuron in the (st) 


hidden layer. Here the pens (-)4] denotes the index of hidden layers. In 
general, a bias term, say bt , can be added into (3.33). But for illustrative simplicity, 
we will drop all the bias terms throughout. Next the linear combination is passed 
onto an activation function, so we get: 


a) = oH) (3.34) 


where o!I(-) indicates an activation function employed in the 1st hidden layer. 
Usually we are allowed to use different activation functions across different layers, 
while the same activation function applies within the same layer by convention. 
Applying the same operation to the other neurons in the hidden layer, we obtain a 
picture like the one in Fig. 3.13. 

For notational simplicity, we introduce a matrix, say WH, which aggregates all 
the weight components associated with the input layer and the hidden layer: 


1] 
[i] [i [i] 
w. wW. 
wil a 21 a an | © px” (3.35) 
üo (1) 
Win, “L112 ally 


input layer hidden layer output layer 


Figure 3.13. The architecture of a hidden layer in a 2-layer DNN. 
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input layer hidden layer output layer 


Figure 3.14. The architecture of a 2-layer DNN. 


where n! denotes the number of neurons in the (1st) hidden layer. Using this 
matrix notation, we can then represent the output of the hidden layer as: 


ad = o (wx) eR” (3.36) 


where o!!](.) indicates a component-wise function. 
Applying the same operation into the one between the hidden and output layers, 
we obtain a picture like the one in Fig. 3.14. Using another matrix notation, say 


Wl e RIX” we can represent the output in the output layer as: 
j= 0" (w2) ER (3.37) 


where øP! indicates an activation function at the output layer, which can possibly 
be different from o!"/(-), as mentioned earlier. 


General DNN architecture In general, an (Z + 1)-layer DNN (an L-hidden- 
layer DNN) can be expressed as: 


I 


gill (wx) ER”, 
a” = o (wam) ER”, 

(3.38) 
dé] = o E (wi ali=11) ER”, 


$= ol (wenan) ER 
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where all € R”” indicates the output of the ith hidden layer; ø! (-) denotes the 
component-wise activation function at the ith hidden layer; and W!! € RUN 
denotes the weight matrix associated with the ith hidden layer and (¢ — 1)th hid- 
den layer. For notational consistency, one can define the Oth hidden layer as the 
input layer; (L + 1)th hidden layer as the output layer, and hence, 4! := n and 
n+! .— 1. 

How DNNs were proposed We explain how the DNN expressed in (3.38) was 
developed. The first DNN was proposed in 1965 by an Ukrainian mathematician, 
named Alexey Ivakhnenko (Ivakhnenko, 1971); see the first left picture in Fig. 3.16. 
He noticed that the Perceptron architecture is too simple to represent a somewhat 
complex system. So he believed that a proper architecture should incorporate much 
more neurons as well as capture much higher connectivities across neurons. Obvi- 
ously the most complex structure is the one in which each neuron is connected 
with all of the other neurons. But it was not that clear to him as to whether such 
complex structure is indeed the case in biological networks for brains of intelligent 
beings. 

He was trying to gain some insights into this from another field: evolution in 
biology. In particular, he was inspired by genetic natural selection in evolution. What 
he was inspired is that a complex species is a consequence of evolution through many 
generations by natural genetic selection. He then made an analogy between such evo- 
lution process and the process of a complex system (machine) of interest. Specifi- 
cally he came up with operations/entities in a complex system which correspond to 
species, generation and selection that appear in the evolution process. See Fig. 3.15. 


Ñ species 


another generation 


generation 


Figure 3.15. Analogy between the evolution process (by genetic natural selection in biol- 
ogy) and the process of a complex system (machine). 
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First of all, the species was mapped to the output ĵ in an interested system. The 
generation was interpreted as the process that occurs in between two consecutive 
layers in the system. So the process with two generations yields a 2-layer neural 
network. Lastly the selection was captured by the activation process in the system. 
These analogies naturally led to the DNN architecture illustrated in Fig. 3.15. 


Some pioneering efforts on DNNs Initially, the DNN architecture in 
Fig. 3.15 was investigated in depth by only a few people in the field. One of the rea- 
sons was that there was no theoretical basis which supports that the architecture can 
represent any arbitrary complex functional of a system. The architecture was based 
solely on the hypothesis. There was no proof. Even worse, it was not that simple to 
do experimental verification because the technology of the day was so immature that 
the time required to train a DNN was very long from days to weeks. Nonetheless, 
there were some people who studied this architecture in depth. Here we list three 
of them below. 

The first is obviously the inventor of the DNN architecture: Alexey Ivakhnenko. 
One of his great achievements in this field was to propose a 7-hidden-layer DNN 
in 1971. The second pioneer is a Japanese computer scientist, named Kunihiko 
Fukushima; see the middle picture in Fig. 3.16. He developed a specially-structured 
DNN intended for pattern recognition in computer vision in 1980 (Fukushima, 
1980). That was actually the first convolutional neural network (CNN), which is 
now known as the most famous and widely-used DNN in the computer vision 
field. The third pioneer is a French computer scientist, which is now very famous 
in the deep learning field and also a winner of the 2018 Turing Award (considered 
as the Nobel Prize in computer science). In 1989, he trained a CNN for the purpose 
of recognizing handwritten ZIP codes on mails (LeCun et al., 1989). This devel- 
opment played a role to vitalize the deep learning field because the trained CNN 
worked very well and so was commercialized. Nonetheless, it was not enough to 
enable the deep learning revolution. One of the reasons was that the training time 
required 3 days with the technology of the day. 


Alexey Kunihiko 
lvakhnenko Fukushima 


Yann LeCun 


Figure 3.16. Three deep learning pioneers in early days. 
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Not appreciated much in early days There were more critical reasons as to 
why the DNN architecture was not appreciated much at that time. These are two 
folded. The first reason is concerning performances. The DNN-based algorithms at 
the time were easily outperformed by much simpler approaches. One of the simpler 
approaches was Support Vector Machines (SVMs), which is sort of a variant of the 
margin-based LP classifier that we learned in Part I. Remember that the margin- 
based LP classifier is very simple and runs very fast. On the other hand, the DNN 
architecture was relatively much more complex, yet even worse, the performance 
was not better. The second reason is about model complexity. The DNN model was 
so complex considering the technology of the day, so it required very long training 
time, from days to weeks. 


Why appreciated nowadays? But as many of you know, the DNN is greatly 
appreciated nowadays. Why is that? As mentioned earlier, this is mainly due to 
the recent big event by Hinton‘ and his PhD students who demonstrated that 
DNNs can achieve human-level recognition performances. Then, how that hap- 
pened? There are two technology breakthroughs that enabled the deep learning 
revolution; see Fig. 3.17. 

The first breakthrough is the advent of big data. Nowadays we are living in the big 
data era. There are tons of data that are floating in the cyber-world. So it is possible 
to gather lots of training data. One such huge dataset gathered for the purpose of 
image cognition was ImageNet (Russakovsky et al., 2015). The dataset was created 
in 2009 by a computer-vision team at Stanford, led by Professor Fei-Fei Li. It turned 
out this dataset played a crucial role for Hinton’s team to demonstrate the power of 
DNN by offering a sufficiently large number of training samples enough to learn 
a complex model. 


IMAGENET 


NVIDIA 


Figure 3.17. Two technology breakthroughs that enabled the deep learning revolution. 


4. Geoffrey Hinton is also a co-winner of the 2018 Turing Award with Yann LeCun and another giant in the 
field, named Yoshua Bengio. 
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The second breakthrough is the supply of very fast and not-so-expensive 
Graphic Processing Units (GPUs). The major company that provided such GPUs 
is NVIDIA. GPUs offered great computational power to reduce training time of 
DNNs significantly. 


An optimization problem Now let us connect the DNN architecture to opti- 
mization of this book’s interest. For illustrative purpose, let us formulate an opti- 
mization problem for the simple two-layer DNN in Fig. 3.14. The optimization 
problem based on the DNN reads: 


har ate 
min > — £9, 5) (3.39) 
i=1 


where: 
70 = gl (weg) 
dO — gill (w0) , 
w = (Wil we). 


Optimal loss function Suppose we use the logistic function for 


1 


[2] = = 
o" (z) = o (z) := EFE 


Then, using exactly the same argument that we made in Section 3.2, one can show 
that the optimal loss is cross entropy loss: 


€"(y,9) = Laly) = —y logy — (1 — y) log(1 — ĵ). (3.40) 
Is it convex? Pugging cross entropy loss into (3.39), we obtain: 


m 


1 
arg min — 5 —y® log 7 — (1 — y) log(1 — 7) (3.41) 
w m 


i=1 


where 7) = o (Wwa), Now a natural question arises. How to solve the 
problem? We are familiar with solving only convex optimization problems. So the 
question of our interest is: Is the objective function convex? Obviously it depends 
on how to choose an activation function in the hidden layer: ø!!! (-). 


Deep Learning | 209 


Look ahead There is a very well-known and powerful activation function for 
oI(.). Unfortunately, under the choice of the function, the optimization prob- 
lem (3.41) was shown to be non-convex. But there is a good news. That is, we have 
a way to handle such a non-convex problem. In the next section, we will study 
details on the way. 
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3.4 Deep Learning Il 


Recap In the previous section, we learned about the architecture of DNNs which 
have been shown to be quite powerful in recent years. We also discussed a brief his- 
tory of DNNs; who the inventor was; what the motivation was; why they were not 
appreciated until recently; and what led to the deep learning revolution. We then 
formulated an optimization problem for DNNs to start talking about connection 
to optimization topics of this book’s interest. Here is the optimization problem 
intended for a 2-layer DNN: 


| = l-y © l 1-57 42 
AE Toa ZAO 5 — ( ) log( ) (3.42) 


where 
39 =6 (wao) -dO = oH (w0) , 
Here ø (-) indicates the component-wise logistic function defined as: 


1 
o (z) := E (3.43) 
On the other hand, ø} (-) denotes a possibly-different activation function used at 
the hidden layer. 
At the end of the last section, we claimed that for a widely-used ø!!! (-), the 
objective function in (3.42) is non-convex, which is intractable in general. We also 
claimed that there is a proper way to address such a non-convex optimization prob- 


lem. 


Outline In this section, we are going to support these claims. Specifically what 
we are going to do are four folded. First of all, we will study what the widely-used 
activation function is. We will then check that the objective function is not convex. 
Next we will investigate what the proper way is. Finally we will discuss how to solve 
the optimization problem in detail. 


Widely-used activation function Consider an operation that occurs at one 
neuron in the hidden layer; see Fig. 3.18. As mentioned earlier, the DNN archi- 
tecture takes the Perception architecture as the basic operation unit. So the basic 
operation consists of two procedures. First we perform a linear operation by aggre- 
gating weighted signals coming from neurons in the preceding layer, thus yielding 
an output, say z. The output z is then passed onto an activation function, so we 
get an output, say a. 
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Figure 3.18. Rectified Linear Unit (ReLU). 


The activation function which has been widely used during recent years is a 
function named the Rectified Linear Unit, simply called ReLU. The functional is 
very simple; it simply bypasses the input if it is non-negative; yields 0 otherwise: 


z, ifz>0; 
E y ifz <0. (3.44) 
So it can be represented as: 
a = max(0, z). (3.45) 


A brief history of ReLU In fact, the ReLU function was introduced in very early 
days in a variety of fields, not limited to the deep learning field. It appeared even in 
Fukushima’s 1980 paper on convolutional neural networks (CNNs) (Fukushima, 
1980). 

But the function was not frequently used in DNNs until recently. Instead 
more interpretable activation functions like the logistic function were widely used. 
Another popular activation function was a shifted version of the logistic function, 
called the tanh function: 


2 gk 
tanh(z) = aa 
1 e7? 
= 1+ e7 2% ~ 1+ e7 2% (3.46) 


= o (2z) — (1 — o (22)) 

= 20 (2z) — 1. 
Note that tanh(z) = 20 (2z) — 1, so the range of the function is shifted from (0, 1) 
to (—1, 1). 


A common rule of thumb that had been applied in the deep learning field until 
recently was to use the logistic function only at the output layer while taking the 
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Yoshua Bengio Seppo Linnainmaa 


Figure 3.19. Yoshua Bengio (left) is one of the giants in the deep learning field, and also 
a co-recipient of the 2018 Turing Award. One of his main achievements is to demonstrate 
the power of ReLU activation function, thus popularizing the use of the function in the 
deep learning society. Seppo Linnainmaa (right) is the inventor of backpropagation which 
serves as an efficient way of computing gradients in DNNs. 


tanh function at all of the other neurons placed in hidden layers. Many empirical 
results have demonstrated that the rule of thumb always yields a better or equal 
performance, as compared to the other alternative which takes the logistic func- 
tion at all places. There is no theoretical justification on this. But it looks more or 
less making-sense. The reason is that taking the tanh function at hidden neurons 
broadens the output range, thus yielding a more degree of freedom relative to the 
one by the logistic function. 


ReLU became prevalent since 2011 A recent big wave arose in the domain 
of activation functions. In 2011, one of the deep learning heroes, named Yoshua 
Bengio (see the left picture in Fig. 3.19), together with his group members, Xavier 
Glorot (PhD student) and Antoine Bordes (postdoc), demonstrated via extensive 
simulation results that ReLU enables faster and more effective training of DNNs, 
compared to the logistic and/or tanh functions (Glorot et al., 2011). This also was 
empirically confirmed by numerous practitioners on many datasets. Hence, ReLU 
now acts as a default activation function in hidden layers. 

They also provided some intuitions as to why that is the case. One intuition 
is that ReLU better mimicks how brains of intelligent beings work. A report by 
neuroscientists says that only a few percentages of the neurons in human brains are 
activated even during active brain activities. This is somewhat consistent with a 
consequence of taking ReLU, as that way leads many of the neurons to be simply 
set to 0 when their values take negative. 

Another explanation concerns a technical operation, being tailored for a par- 
ticular yet popular learning algorithm. One of the popular training algorithms 
employed in the field is based on computation of gradients of the objective function 
w.t.t. weights (optimization variables). Dynamics of gradients for the logistic or 
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tanh functions are somewhat limited. They are close to 0 for a large or small value 
of z. Why? Think about the shape of the function. It takes somewhat a meaningful 
gradient only when z is in a narrow range. On the other hand, the gradient of ReLU 
does not vanish even when z is very large. Notice that the gradient of ReLU reads: 


1; if z > 0; 
ReLU 
ied ae Faci; (3.47) 
dz 


undefined, if z=0. 


Note that the gradient takes 1 even when z is very large. So they believed that this 
non-vanishing gradient effect yields a better training. 


Remark on ReLU As you may see from (3.47), there is an issue in computing the 
gradient of ReLU. The issue is that the function is not differentiable. The gradient is 
undefined at z = 0. But this is not a big deal in practice. In reality, the event z = 0 
rarely happens. Actually the exactly-zero-event never happened. It has a measure- 
zero-event. So there is no problem to use in practice, although it is mathematically 
problematic. 


Convex vs. non-convex? Let us go back to the optimization problem. When 
taking the ReLU activation function, we obtain: 


= (2) _@ BENO) 
iS o Pwe 2 za (Ly log =) (3.48) 


where 
Dw =0 (we max(0, wily) . 


The question of this book’s interest is: Is the objective function convex? As claimed 
earlier, the answer is no. The objective function is non-convex in general. The proof 
of this is a bit involved, as it may include a complicated Hessian calculation. So this 
may distract us from the main stream of the contents of this section. So we omit 
the proof here. But you will have a chance to prove this in Prob 8.5 for a simple 
setting, illustrated in Fig. 3.20. 


T © b= Typ ena 


a = max(0, w12) 


Figure 3.20. A simple 2-layer DNN in which an associated objective function is non- 
convex. 
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Figure 3.21. Shape of the objective function of the simple 2-layer DNN in Fig. 3.20. 


Ifyou are not interested in the proof, you may want to be convinced about non- 
convexity from the following numerical plot in Fig. 3.21. Notice in the figure that 
the objective function in this simple setting is indeed non-convex. 


A way to handle such a non-convex problem As mentioned earlier, there 
is a proper way to handle such a non-convex optimization problem. The way is 
inspired by the observation made by numerous practitioners working on the field. 
Many experimental results by them revealed that in most cases: 


Any local minimum is the global minimum. (3.49) 


What this means is that in many of the practical settings, there is no spurious 
local minimum. As you may conjecture, this is not a mathematically correct state- 
ment. Actually this was proven to be mathematically wrong, meaning that there are 
counter-examples in which there are spurious local minima. But it was also empir- 
ically shown that those counter-examples rarely happen in many of the working 
DNNs. In fact, we are still very much lacking in our understanding on this. In 
other words, currently we have no idea what is the necessary/sufficient condition 
for (3.49) to hold. 

Nonetheless, in many realistic scenarios, (3.49) was observed. So many people 
believe that in most interested cases, the landscape of the objective function for an 
DNN-based optimization problem looks like the one in Fig. 3.22. 

Note in Fig. 3.22 that there is no spurious local minimum. We have only 
the global minimum or saddle points.’ This observation made through many 


5. A saddle point is defined as a point in which its derivative is zero while having neither a maximum nor a 
minimum value. 
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Figure 3.22. Landscape of the objective function in DNN optimization. 


experimental results suggested a good guideline in practice. That is, simply to find 
any minimum and then take it as a solution, no matter what the type of an opti- 
mization problem is. The question of interest is then: How to find a minimum? 
One very popular way is to apply gradient descent which we are familiar with. Of 
course, the gradient descent algorithm may lead us to get stuck in some saddle 
point which we do not want to arrive at. But the good news in practice is that it is 
extremely rare to get stuck in a saddle point when there are minima. Actually it is 
even difficult to arrive at a saddle point even if we wish to do so. Hence, a general 
rule of thumb is to simply apply gradient descent no matter what. 


Gradient descent What is the gradient descent algorithm in the interested opti- 
mization problem below? 


i 
LY Opt -0-0 = hi 
arg eee = > y” log y (1 —y”) log -— 5”). (3.50) 


i=1 


=) 
Here 9 = o (WPI max(0, Wt), 


The algorithm is to iterate the following procedure: for the ¢th iteration, the new 
(¢ + 1)th estimate for weights takes: 


wD e yO — gv, 7(w) 


where a indicates the learning rate. Since w is a collection of (W', W), the 
detailed procedure is: 


WED © LO aOY yay (w); 


WELD WDO _ GOV gpn w). 


216 Machine Learning Applications 


As you can see here, there are multiple weight update procedures — in this case, 
two procedures. Actually these multiple procedures yielded a critical concern in 
early days of the DNN research. The reason is that it requires computationally 
heavy calculations. Especially when an DNN has many layers, it raises a critical 
computational concern. So some people tried to address this problem in early days 
to come up with an efficient way of computing such many gradients, called: 


Backpropagation. 


Backpropagation The same backpropagation method was independently devel- 
oped by a bunch of research groups including Hinton’s development in 1986 
together with his colleagues, David Rumelhart & Ronald Williams (Rumelhart 
et al., 1986). But the first invention was earlier. It was around when the book of 
Perceptrons was published — that was 1970. In 1970, a Finnish mathematician (as 
well as a computer scientist), named Seppo Linnainmaa (see the right picture in 
Fig. 3.19), invented the method (Linnainmaa, 1970). 

The idea is very natural although it involves some complicatedly-looking math 
equations. Perhaps this may be one of the reasons that there were several indepen- 
dent yet same inventions. The idea is to: 


Successively compute gradients in a backward manner 


by using a chain rule for derivatives. 


For illustrative purpose, let us first explain how it works for a simple single-example 
setting (m = 1). We will then extend it to the general case. 


Backpropagation in action: m = 1 The illustration of the method can be 
streamlined with the help of some picture which visualizes paths of signals. One 
such path is the forward path; see the top row in Fig. 3.23. The input signal x 
passes through the hidden layer to yield z and then al. Similarly we get z”! and 
then 7 := aI, 


a—y ofl = pyle 24} alll = max(0, zit!) Seo! = walt Zl g = o(ZBhL g 


w 


dJ(w) m dJ(w) J(w) dJ(w) 
dzl] dal!) dzl2l dy 
| l 
dJ(w) dJ(w) 
dwt] dw 


Figure 3.23. Illustration of backpropagation: m = 1. 
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The backpropagation starts from backward. Consider the gradient of the objective 


function J (w) w.r.t. the /ast output signal 7: T To ease illustration, we will use 
the notation of $ instead of V5. The reason is that backpropagation is based on 
the chain rule for derivatives, so the notation 4. helps us to better understand how 


it works, relative to V;. Since /(w) is simply cross entropy loss for m = 1, the 
gradient reads: 
dw) _ _y,i-y 


aS (3.51) 
dy j 1-9 


One can easily compute this, since y is given in the problem and ĵ is available once 
we compute the forward path. 


Next, we consider the gradient of J (w) w.r.t. the second last signal z”: ToD, 


This is where the idea of the chain rule kicks in. Using the chain rule, we get: 


dw) _dj(w) dy 


dzd dŷ dz 
a l-y\. A 3.52 
o (44 r)a -5) (aa 
y ln-y 
=j-y 


where (a) follows from (3.51) (already computed earlier) as well as the fact that the 
gradient of the logistic function can simply be expressed as: 


Ze 2 1 )- e* 
dz dz \ 1+) (1+e-2) 


_ 1 e7 (3.53) 
O IF lte% 
= a(z)(1 — o (z)). 


From (3.52), we can compute one of the interested gradients: Ts . Again using 
the chain rule, we get: 
Uw) _ dw) de” 
[2] [2] [2] 
dW dzl?! dw (3.54) 
@ Yw) wir 
dzl 
where (a) comes from z% = Whal (why taking a transpose to get aT?) 
d] (w 


Notice that this can be computed from ) (already computed from (3.52)) and 


dzl2l 
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the knowledge of a!!! To indicate how it can be computed, we draw red-lined flows 
in Fig, 3.23. 

Now you may grab the idea of how backpropagation works. We compute gra- 
dients w.r.t. A rb output signal (7) to the inner signals (z!, W12), all the 
way back to (z!!], WII). For those who did not get this yet, let us A 

We next sae the gradient of J (w) w.r.t. the third last signal al! ; ZD Ww 1 . Again 
using the chain rule, we obtain: 


dJ(w) _ dJ (w) dz” 
dal dz) dal 


(3.55) 
@ wit dj (w) 
dzl] 
where (a) is due to zP! = Wlall, Here you may be very confused about how 


the last ay comes ip, Why do we take a transpose for W1? Why do we first 
have WIT, followed by 2 =D 
thumb for this computation. First of all, you need to check what the dimension of 
the final result is. In this case, the final result is Go The dimension should be 
o Ew all 
aT ER 


aw) ? Why not the other way around? There is a rule of 


exactly the same as that of al", . Next think about the dimension of 


an It should be a eR" — This suggests that ane should come after 


WIT. Otherwise, dimensions do not match — a syntax error occurs. Now why are 


we taking a transpose for W'?!? Again this is due to dimension matching. With the 
transpose, we can make sure that the dimension of the end result is a x 1, which 


is what we want. 

d](w) 
dzl 

(which we already obtained from (3.52)) and the knowledge of wll, See the 
knowledge path marked with red lines in Fig. 3.23. 


We can next do the same thing for g e) and Ts 


Again the key observation in (3.55) is that “Ea can be computed from 


. Using the chain rule, we get: 


d](w) _ fw) dal 
dW) gl) del 
w Jw) 

dall] 


(3.56) 


*l{z! >0} 


where (a) follows from (3.47): ARE U = 1{z > 0}. Actually this is not quite 
correct mathematically, since the ReLU function is not differentiable at z = 0. But 
since it is okay to ignore the rare event in practice, we simply assume that the gradi- 
ent is 1 at z = 0. Here the symbol .* indicates the component-wise multiplication 
(MATLAB notation), not the normal multiplication. You can also easily think that 
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it should be the component-wise multiplication, since otherwise dimensions do not 
match. Next we get: 


dJ (w) _ dw) dzl! 
dW dzl aw 
(O) dj (w) r 
~ dn” 


(3.57) 


where (a) is due to z!!! = Wl, 
Here is a summary of all the important gradients that we derived in a backward 


manner: 
Zo) -a (3.58) 
oo _ ZO mr, (3.59) 
2 2 _ weird, (3.60) 
Te) TO atic > 0} ee 
dw) _ Mw) (3.62) 


dwil dl” ° 


To run the gradient descent algorithm, what we need to use are (3.59) and (3.62). 
But the other gradients are still important because they serve as bridges to compute 


the interested gradients ((3.59) and (3.62)) in the end. 


Backpropagation for general m What about for the general m case? The idea 
is exactly the same. The only distinction is that we need to incorporate all the 
examples in computing gradients. It turns out matrix notations help us to derive 
such generalized gradients. Let 


50 bas jp] eR”; 
AM = [dO D n glib] E R, 
(3.63) 


[ 

[ 
Zu = [210 Zll@ ... gl] € RU xm, 
= [z230 2212)... 22h] c R”, 
[ 


X= [xO D 2. xO] e RM, 
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Using these matrix notations, one can readily show that the important correspond- 
ing gradients for the general m case read: 


me uy (3.64) 
ie) E ae AUT, (3.65) 
= — varT, (3.66) 
ww = ZO ata > 0}; en 
A. _ ae yT. (3.68) 


Note that these are exactly the same as those in the m = 1 case except one thing. 
That is, we have now all the capital letters. 


Look ahead So far we have formulated an DNN-based optimization problem 
for supervised learning, and found that cross entropy serves as a key component in 
the design of the optimal loss function. We also learned how to solve the problem 
via a famous and efficient method, called backpropagation. There would be pro- 
gramming implementation of the algorithm. In the next section, we will study such 
implementation details in the context of a simple classifier via TensorFlow. 
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3.5 Deep Learning: TensorFlow Implementation 


Recap We have thus far formulated an DNN-based optimization problem in the 
context of supervised learning: 


min > bce (v.53) (3.69) 
i=1 


where y“ denotes the label of the ith example; ĵ indicates the predictor output of 
the DNN (often the output of the logistic function); and €ce(y, 7) = —ylogy — 
(1 — y) log(1 — y). We proved that cross entropy loss €ce(-, -) is the optimal loss 
function in a sense of maximizing the likelihood. We have also learned that in many 
of the interested settings, optimization problems for DNNs have no spurious local 
minima, although the problems are highly non-convex. This motivated the use of 
gradient descent for such problems. Lastly we studied an efficient way of computing 
gradients: backpropagation, or simply called backprop. 


Outline In this section, we will study how to implement the algorithm via a soft- 
ware tool in the context of a simple classifier. We will first investigate what that 
simple classifier setting of our focus is. We will then study four implementation 
details w.r.t. the classifier: (i) dataset that we will use for training and testing; (ii) 
an DNN model & ReLU activation; (iii) Softmax: a natural extension of a logistic 
activation for multiple (more than two) classes; and (iv) Adam: an advanced version 
of gradient descent that is widely used in practice (Kingma and Ba, 2014). Lastly 
we will learn how to do programming for the classifier via one prominent deep 
learning framework: TensorFlow. More specifically, we will employ a higher-level 
programming language, Keras, which is fully integrated with TensorFlow. 


Handwritten digit classification The simple classifier that we will focus on 
for implementation exercise is a handwritten digit classifier wherein the task is to 
figure out a digit from a handwritten image; see Fig. 3.24. The figure illustrates an 
instance in which an image of digit 2 is correctly recognized. 


digit 
classifier 


{(2, yy, 


Figure 3.24. Handwritten digit classification. 
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{(x), y)}m, m = 60,000 


] white € {0,1,...,9} 


= " -mhang [0, 1] 
. black 


28 


28 


Figure 3.25. MNIST dataset: An input image is of 28-by-28 pixels, each indicating an 
intensity from O (whilte) to 1 (black); and each label with size 1 takes one of the 10 classes 
from O to 9. 


input size = 784 500 hidden neurons 10 classes 
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pixel 3— QAN 
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ReLU activation 


Figure 3.26. A two-layer fully-connected neural network where input size is 28 x 28 = 
784, the number of hidden neurons is 500 and the number of classes is 10. We employ 
ReLU activation for the hidden layer, and softmax activation for the output layer; see 
Fig. 3.27 for details. 


For training a model, we employ a popular dataset, named the MNIST (Mod- 
ified National Institute of Standards and Technology) dataset. It was created by 
re-mixing the examples from NIST’s origional dataset. Hence, the naming was 
suggested. It was prepared by one of the deep learning pioneers, Yann LeCun. It 
contains m = 60,000 training images and mest = 10,000 testing images. Each 
image, say x”), consists of 28 x 28 pixels, each indicating a gray-scale level ranging 
from 0 (white) to 1 (black). It also comes with a corresponding label, say y, that 
takes one of the 10 classes y € {0,1,..., 9}. See Fig. 3.25. 


A deep neural network model As a model, we employ a simple two-layer 
DNN, illustrated in Fig. 3.26. As mentioned earlier, the rule of thumb for the 
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output layer softmax 


DPOOSOOOOOO®O 


Figure 3.27. Softmax activation for output layer. This is a natural extension of logistic 
activation intended for the 2-class case. 


choice of the activation function in a hidden layer is to use ReLU: ReLU(x) = 
max(0, x). So we adopt this here. 


Softmax activation for output layer So far we have considered binary classi- 
fiers and hence employed the corresponding activation function in the output layer: 
logistic function. In our digit classifier, however, there are 10 classes in total. So the 
logistic function is not directly applicable. One natural extension of the logistic 
function for a general classifier with more than two classes is to use a generalized 
version, called softmax. See Fig. 3.27 for its operation. 

Let z be the output of the last layer in a neural network prior to activation: 


z := (21,22)... z] T e R° (3.70) 


where c denotes the number of classes. The softmax function is then defined as: 

` ea 

Jj = [softmax(z)]; = So a je{1,2,...,e}. (3.71) 
Note that this is a natural extension of the logistic function: for ¢ = 2, 


e! 


el + e2 
1 (3.72) 
1 + e172) 


= o (z1 — 22) 


Jı := [softmax(z)]; 


where ø (-) is the logistic function. Viewing zı — z2 as the binary classifier output 
J, this coincides exactly with the logistic function. 
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Here y; can be interpreted as the probability that the 7th example belongs to 
class 7. Hence, like the binary classifier, one may want to assume: 


J; = P = [0,..., 1. ,...,0]7 |x), iE {1,...,¢}. (3.73) 
ith position 
As you may expect, under this assumption, one can verify that the optimal loss 
function (in a sense of maximizing likelihood) is again cross entropy loss: 


C 


0,9) = lce(y,3) = $ —y log); 
j=! 


where y indicates a label of one-hot vector type. For instance, in the case of label= 2 
with c = 10, y takes: 


S 

II 
ococooocooHoco 
COHONADANKWNHHO 


The proof of this is almost the same as that in the binary classifier. So we will omit 
the proof. Instead you will have a chance to prove it in Prob 8.2. 

Due to the above rationales, the softmax activation has been widely used for 
many classifiers in the field. Hence, we will also use the conventional function in 
our digit classifier. 


Adam optimizer Let us discuss a specific algorithm that we will employ in our 
setting. As mentioned earlier, we will use an advanced version of gradient descent, 
called the Adam optimizer. To see how the optimizer operates, let us first recall the 
vanilla gradient descent: 


wt) e w — a VJ (w®) 


where w®) indicates the estimated weight in the rth iteration, and a denotes the 
learning rate. Notice that the weight update relies only on the current gradient, 
reflected in V/(w™). Hence, in case V/(w™) fluctuates too much over iterations, 
the weight update oscillates significantly, thereby bringing about unstable training. 
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To address this, people often use a variant algorithm that exploits past gradients for 
the purpose of stabilization. That is, the Adam optimizer. 
Here is how Adam works. The weight update takes the following formula 


instead: 
(O) 
CD = yO 2 
w = w” + a —— (3.74) 
VO +e 

where m® indicates a weighted average of the current and past gradients: 

1l = 

mi = (Bim) — 1 = BVI). (3.75) 
~ Pi 


Here pı € [0, 1] isa hyperparameter that captures the weight of past gradients, and 
hence it is called the momentum. So the notation m stands for momentum. The 


z is applied in front, in an effort to stabilize training in initial iterations 
(small t). Check the detailed rationale behind this in Prob 8.9. 
s is a normalization factor that makes the effect of V/(w™) almost constant 


factor 


over t. In case V/(w™) is too big or too small, we may have significantly different 
scalings in magnitude. Similar to m), s is defined as a weighted average of the 
current and past values: 


1 


(4) 
5 iB 


(B=? = 0 = BAVI W) (63.76) 


where £2 € [0, 1] denotes another hyperparameter that captures the weight of past 
values, and s stands for square. 

Notice that the dimensions of w, m and s are the same. So all the opera- 
tions that appear in the above (including division in (3.74) and square in (3.76)) 
are component-wise. In (3.74), € is a tiny value introduced to avoid division by 0 in 
practice (usually 1078). 


TensorFlow: MNIST data loading Let us study how to do TensorFlow pro- 
gramming for implementing the simple digit classifier that we have discussed so 
far. First, MNIST data loading. MNIST is a very famous dataset, so it is offered 
by a sub-library: tensorflow.keras.datasets. Even more, train and test datasets are 
already therein with a proper split ratio. So we do not need to worry about how to 
split them. The only script that we should write for importing MNIST is: 

from tensorflow.keras.datasets import mnist 

(X_train, y_train), (X_test, y_test) = mnist.load_data() 

X_train = X_train/255. 

X_test = X_test/255. 
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Here we divide the input (X_train or X_test) by its maximum value 255 for the 
purpose of normalization. This procedure is often done as a part of data prepro- 
cessing. 


TensorFlow: 2-layer DNN In order to implement the simple DNN, illustrated 
in Fig. 3.26, we rely upon two major packages: 


(i) tensorflow.keras.models; 


Cii) tensorflow.keras.layers. 


The models package contains several functionalities regarding a neural network 
itself. One major module is Sequential which is a neural network entity and hence 
can be described as a linear stack of layers. The layers package includes many ele- 
ments that constitute a neural network. Examples include fully-connected dense 
layers and activation functions. These two allow us to readily construct a model 
illustrated in Fig. 3.26. 


from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import Dense, Flatten 


model = Sequentiald 
model.add(Flatten(input_shape=(28,28))) 
model.add(Dense(500, activation=’relu’)) 
model.add(Dense(10, activation=’softmax’)) 


Here Flatten is an entity that indicates a vector expanded from a higher dimen- 
sional one, like a 2D matrix. In this example, a digit image of size 28-by-28 is 
flattened into a vector of size 784(= 28 x 28). addQ is a method for attach- 
ing an interested layer to the last part in the sequential model. Dense refers to a 
fully-connected layer. The input size is automatically determined by the last part 
that it will be attached to in the sequential model. So the only thing to specify is 
the number of output neurons. In this example, 500 refers to the number of hid- 
den neurons. We can also set an activation function with another argument, like 
activation=’relu’. The output layer comes with 10 neurons (coinciding with the 
number of classes) and softmax activation. 


TensorFlow: Training a model For training, we need to first set up an algo- 
rithm (optimizer) to be employed. Here we use the Adam optimizer. As mentioned 
earlier, Adam has three key hyperparameters: (i) the learning rate a; (ii) 61 (captur- 
ing the weight of past gradients); and (iii) £2 (indicating the weight of the square 
of past gradients). The default choice reads: (a, 21, 82) = (0.001, 0.9, 0.999). So 


these values would be set if nothing is specified. 
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We also need to specify a loss function. Here we employ the optimal loss func- 
tion: cross entropy loss. A performance metric that we will look at during training 
and testing can also be specified. One metric frequently employed is accuracy. One 
can set all of these via another method compile. 

model.compile(optimizer='adam’, 


loss=’sparse_categorical_crossentropy’, 
metrics=[’acc’]) 


Here the option optimizer=’adam’ sets the default choice of the learning rate and 
betas. For a manual choice, we first define: 


opt=tensorflow.keras.optimizers.Adam( 
learning_rate=0.01, 
beta_1 = 0.92, 
beta_2 = 0.992) 


We then replace the above option with optimizer=opt. The option loss= 
‘sparse_categorical_crossentropy’ means the use of cross entropy loss. 

Now we can bring this to train the model on MNIST data. During training, we 
often employ a part of the entire examples to compute a gradient of a loss function. 
The part is called batch. Two more terminologies. One is the step which refers to a 
loss computation procedure spanning the examples only in a single batch. The other 
is the epoch which refers to the entire procedure associated with all the examples. 
In our experiment, we use the batch size of 64 and the number 20 of epochs. 


model.fit(X_train, y_train, batch_size=64, epochs=20) 


TensorFlow: Testing the trained model For testing, we first need to make a 
prediction from the model ouput. To this end, we use the predict() function as 
follows: 


model.predict(X_test).argmax(1) 


Here argmax(1) returns the class w.r.t. the highest softmax output among the 10 
classes. In order to evaluate the test accuracy, we use the evaluate() function: 


model.evaluate(X_test, y_test) 


Look ahead This is the end of the supervised learning part. There may be more 
contents that may be of your interest. But we stop here due to the interest of other 
topics. Obviously we cannot cover all the contents. If you are interested in more on 
supervised learning, we recommend you to take many useful deep learning courses 
offered online, e.g., Coursera. 
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In the next section, we will move onto another application of optimization: 
unsupervised learning. Specifically we will focus on one of popular machine learn- 
ing frameworks for unsupervised learning, called Generative Adversarial Networks 
(GANS for short) (Goodfellow eż al., 2014). It turns out the duality theorems that 
we learned in Part II play a crucial role to understand the GANs. We will cover 
details from the next section onwards. 
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Problem Set 8 


Prob 8.1 (Logistic regression) Suppose that {x yO), are independent 
across all the examples. Show that 


PO aa ne) = | [PO x). (3.77) 
i=1 


In Section 3.2, we proved under the Perceptron architecture that logistic regression 
is optimal in a sense of maximizing: 


P (10O): 


Prove that logistic regression is optimal also in a sense of maximizing the following 
conditional probability: 


Py, ..., yD, x). (3.78) 


Prob 8.2 (Multiclass classifier & softmax) This problem explores a general 
setting in which the number of classes is arbitrary, say c. Let 


z := [z1 22... Z] € R° (3.79) 


be the output of a neural network model prior to activation. In an attempt to make 
those real values z;’s being interpreted as probability quantities that lie in between 
0 and 1, people usually employ the following activation function, called softmax: 


ei 
D 


Note that this is a natural extension of the logistic function: for ¢ = 2, 


Jj := [softmax(z)]; = je {1,2,...,c}. (3.80) 


e! 


el + e? 


1 (3.81) 
1 + e772) 


Jı := [softmax(z)]; = 


= o (zı — 22) 


where o (-) is the logistic function. Viewing z1 — z2 as the binary classifier output 
J, this coincides with logistic regression. 

Let y € {{1,0,...,0]7,[0,1,0,...,0]7,...,[0,...,0,1]7} be a label of one- 
hot-vector type. Here J; can be interpreted as the probability that the 7th example 
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belongs to class 7. Hence, let us assume that 


ji = P = [0,..., 1. ,...,0]7 |x), Ze {I,...,c}. (3.82) 


ith position 


We also assume that training examples {(x®, y)}”, are independent over i. 


(a) Derive the likelihood of training examples: 
P GO, re del eee a”) : (3.83) 


This should be expressed in terms of y®’s and js. 

(b) Derive the optimal loss function that maximizes the likelihood (3.83). 

(c) What is the name of the optimal loss function derived in part (b)? What is 
the rationale behind the naming? 


Prob 8.3 (Jensen’s inequality & cross entropy) 


(a) Suppose that a function f is concave and X is a discrete random variable. 


Show that 


E [SCO] < f EK). 


Also identify conditions under which the equality holds. 
(b 


— 


In Section 3.2, we defined cross entropy only for two binary random vari- 
ables. Actually it can be defined for any two arbitrary distributions, say p 
and q, as: 


1 
HA (p,q) := — S| pe) log q(x) = E, tog =| (3.84) 


xEX 


where X € X is a discrete random variable. Show that 


; 1 
Hp. > HQ) :=— Dp) logp(&) = E tog 5] (3.85) 


xEX 


where H(p) is known as the Shannon entropy. Also identify conditions 
under which the equality in (3.85) holds. 
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Prob 8.4 (Kullback-Leibler divergence) In statistics, information theory and 
machine learning, there is a very well-known divergence measure, called Kullback- 
Leibler divergence, defined as: for two distributions p and 4, 


p(x) n | eD] 
KLD(p, q) := > (x) log —- = E, | log —— (3.86) 
(p q) a? ) 8 a(x) p 8 700 
where X € & is a discrete random variable. Show that 
KLD(p, q) > 0. (3.87) 


Also identify conditions under which the equality in (3.87) holds. 


Prob 8.5 (Non-convexity of a 2-layer DNN) Consider a 2-layer DNN with 
one input neuron, one hidden neuron and one output neuron. We employ ReLU 
activation for the hidden neuron while using the logistic function for the output 
neuron: 


1 
1 + e7™™2 max(0,w1x) 


(3.88) 


ĵ= 


where x € R indicates an input; wı € R and w2 € R denote the weights for hidden 
and output layers, respectively. Let y € {0, 1} be a label. Consider cross entropy loss 
for an objective function: 


J(wi, w2) = —y log ĵ — (1 — y) log(1 — ĵ). (3.89) 


(a) Consider a case in which (w1, w2) = (t,t) where ż e R. Derive 
V,J (w1, w2). Compute V,J (w1, w2)l:=0- 

(b) Still consider the case in part (a). Derive VŽJ (w1, w2). 

(c) Now consider a different case in which (w1, w2) = (t, —t) where ż € R. 
Derive V;/ (w1, w2). Compute VJ (w1, w2)|:=0. 

(d) Consider the case in part (c). Derive V7 (w1, w2). 

(e) Show that J (w1, w2) is non-convex in (w1, w2). 


Prob 8.6 (Backpropagation: 2-layer DNN with bias terms) In Section 3.3, 
we considered an DNN architecture in which a linear operation that occurs at each 
neuron does not allow for having a bias term: 


z” = Wa (3.90) 


where zl! e R”” and al e R”” indicate the pre-activation and post-activation 
outputs of the ith hidden layer, respectively; and W"! € Rr xa] denotes a 
weight matrix between the (¢ — 1)th and ith hidden layers. 
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In this problem, we explore a slightly more general DNN architecture which 
allows for having a bias term in the linear operation: 


zI — wiad 4 pl (3.91) 


where 5! e R””. One special structural assumption on ġ is that all the compo- 
nents in $l? are identical: 


where 6; € R. 

Consider a 2-layer DNN with such bias terms. We assume that the hidden layer 
activation is ReLU and the output layer activation is logistic. Let {(«, y)}”, be 
training examples. Consider cross entropy loss for an objective function: 


m 


Jw) = $ -® log)? — (1 — y) log = 79) (3.92) 


i=1 


where w := (W1, wll). 


(a) Suppose m = 1. Show that 


mw tye (3.93) 
A. _ ZO pur, (3.94) 
ake _ a. (3.95) 
gat = we 59 
A M T ot fol 0): (3.97) 
wl = q saa (3.98) 
dJ(w) _ dJ(w) (3.99) 


dbl dzl ` 
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(b) Consider the general m case. Let 


Y := [y® y ete y] E€ R!*”,; 
Y= [5 ji n 5] c RIX”, 
AY = [d0 glth@) alien] e px 
(3.100) 
ZU = [dO zO zl] e pm 
zeka [20 O z0] e pm 
X — [« x2) x] ë R” 
Show that 
d x 
a =a (3.101) 
dj(w)  Y(w) yr 
awe) gz ; (3.102) 
dJ (w) “fd (w) 
db = >, E i (3.103) 
J (w) gr Y (w) 
au = ar (3.104) 
dJ (w) dJ (w) [1] 
gu~ game (3.105) 
d(w) dw) r 
dwil = P ; (3.106) 
dJ (w) m dJ (w) 
dp 2 Ea , (3.107) 


where [ oO | indicates the ith column component of a Co) forj € {1,2}. 


Prob 8.7 (Backpropagation: 3-layer DNN) Consider a 3-layer DNN such 
that: 


2" = wid- eR” fe (1,2,3}; 
al! = max(0,2) E R” ie {1,2}; (3.108) 


1 
[3] _ ER 
1+ e7 


S> 
II 
a 
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where a! is defined as the input x € R”, and W e Re denotes a weight 
matrix associated with the (¢ — 1)th and ith hidden layers. Let {(, yO), be 
training examples. Consider cross entropy loss for an objective function: 


m 


J(w) = By —y) log -(1 — y) log(1 — 7) (3.109) 


i=1 


where w := (WPI, WPI, WHI). Describe the detailed procedure of backpropaga- 
tion, i.e., draw the forward and backward paths and list all the key gradient-related 
equations with a correct order. 


Prob 8.8 (Implementing the XOR function) Consider a 2-layer DNN with 
two input neurons, two hidden neurons and one output neuron. We employ ReLU 
activation for the hidden layer while taking logistic activation for the output neu- 
ron. We allow for having bias terms in all of the layers: 


j9 =6 (we max(0, WH x + H) + 5?!) (3.110) 


where x e€ {0,1}* indicates the ith input example; W! e R?*? and WP! € 
R!*? denote the weight matrices for hidden and output layers, respectively; and 
bH © R? and 6”! € R indicate the bias terms at hidden and output layers, respec- 
tively. Let y € {0, 1} be the ith label. Let training examples be: 


((0,0),0), ¿= l; 
((0,1),1), ¿= 2; 
((1,0),1), ¿= 3; 
((1,1),0), ¿=4 


(x,y) = 


Consider cross entropy loss for an objective function: 


4 
Jw) = > =y® log 5 — (1 = y) log — 9) (3.111) 


i=1 


where w := (W!, WH). Use the same matrix notations as in (3.100). 


(a) Draw the forward path. Implement a Python script for the following func- 
tion: 


[Yhat,Z2,A1,Z1] = ForwardPath(W1,b1,W2,b62,X) 
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where: 


(Yhat,Z2,A1,Z1.W1,b1W2,b2,X) 
= (Ê, 22), All, Zl), yO, gl), w2, 421, xy, 


(b 


wa 


Draw the backward path for backprop. Implement a Python script for the 
following function: 


[dZ2,dW2,db2,dA1,dZ1,dW1,db1] 
= BackwardPath(Cy,Y hat,W1,b1,W2,b62,A1,Z1,X) 


where: 


(dZ2,dW2,db2,dA1,dZ1,dW1,db1) 


E (2 dw) dw) Jw) Jw) Jw) ar) 
~\ ZZi? gw) del’? dA’ AZO’ gw’ ql J 


(c) Using the already implemented functions in parts (a) and (4), implement 
a Python script for training the DNN via gradient descent. 
(d) Consider the following weight initialization: 


import random 
from numpy.random import randn 


random.seed(0O) 


W1 = randn(n1,n) 
W2 = randn(i,n1) 
b1 = randn(nt,1) 
b2 = randnd,1) 


where (nin) := (a, n) = (2, 2). Set the learning rate as 0.1. Fix the num- 
ber of iterations (also called epoches) as 10000. Run the code implemented 
in part (c) together with the above initialization to compute (W1,b1,W2,b62). 

(e) For an input x e {(0,0), (0,1), (1,0), (1, 1)}, use (W1,61.W2,b2) com- 
puted in part (d) to obtain y. 


Prob 8.9 (Optimizers) Consider the gradient descent algorithm: 
ger = w® — aVJ(w®) 


where w® indicates the weights of an interested model at the ¢th iteration; / (w) 
denotes the cost function evaluated at w®; and a is the learning rate. Note that 
only the current gradient, reflected in V/(w™), affects the weight update. 


236 


(a) 


(b) 


(c) 
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(Momentum optimizer) In the literature, there is a prominent variant of gra- 
dient descent that takes into account past gradients as well. Using such past 
information, one can damp an oscillating effect in the weight update that 
may incur instability in training. To capture past gradients and therefore 
address the oscillation problem, another quantity, often denoted by m, is 
usually introduced: 


m® = pmd — (1 = PVJ (w) (3.112) 


where p denotes another hyperparameter that captures the weight of the 
past gradients, simply called the momentum. Here m stands for the momen- 
tum vector. The variant of the algorithm (often called the momentum opti- 


mizer) takes the following update for w+): 


wD) = wO + am, (3.113) 
Show that 
t—1 
wit) = yO — a(l — $) > Vw") + apm, 
k=0 


(Bias correction) Assuming that VJ (w) is the same for all ż and m© = 0, 
show that 


wD) = w® —a(l — BVI (Ww). 


Note: For a large value of t, 1 — p* © 1, so it has almost the same scaling 
as that in the regular gradient descent. On the other hand, for a small value 
of t, 1 — p’ can be small, being far from 1. For instance, when p = 0.9 and 
t = 2,1 —’ = 0.19. This motivates people to rescale the moment m” 
in (3.112) through division by 1 — £’. So in practice, we use: 


() 
A = Z. 
aO = (3.114) 
wet) = w®™ + am. (3.115) 


This technique is so called the bias correction. 

(Adam optimizer) Notice in (3.112) that a very large or very small value of 
VJ(w) affects the weight update in quite a different scaling. In an effort 
to avoid such a different scaling problem, people in practice often make 
normalization in the weight update (3.115) via a normalization factor, often 
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denoted by 3: 
pt) 
wet) = yO + a (3.116) 


VSO +e 


where the division is component-wise, and 


() 
Ao) = = n (3.117) 
1 
mO = Bm?) — (1 = B) VJ (w®), (3.118) 
Q) 
ae) = — (3.119) 
2 
5 = Bos) + (1 — fo) (VI wP. (3.120) 


Here (-)? indicates a component-wise square; € is a tiny value introduced 
to avoid division by 0 in practice (usually 1078); and s stands for square. 
This optimizer (3.116) is called the Adam optimizer. Explain the rationale 
behind the division by 1 — 25 in (3.120). 


Prob 8.10 (TensorFlow implementation of a digit classifier) Consider a 
handwritten digit classifier that we learned in Section 3.5. In this problem, you are 
asked to build a classifier using a two-layer (one hidden-layer) neural network with 
ReLU activation in the hidden layer and softmax activation in the output layer. 


(a) (MNIST dataset loading) Use the following script (or otherwise), load the 
MNIST dataset: 


from tensorflow.keras.datasets import mnist 
(X_train,y_train),(X_test,y_test)=mnist.load_dataQ 
X_train = X_train/255. 

X_test = X_test/255. 


What are m (the number of training examples) and mest? What are the 
shapes of X_train and y_train? 

(b) (Data visualization) Upon the code in part (a) being executed, report an 
output for the following: 


import matplotlib.pyplot as plt 
num_of_images = 60 
for index in range(i,num_of_images+1): 
plt.subplot(6,10, index) 
plt.axisC off’) 
plt.imshow(X_train[index], cmap = ‘gray_r’) 
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(c) (Model) Using a skeleton code provided in Section 3.5, write a script for a 
2-layer neural network model with 500 hidden units fed by MNIST data. 

(d) (Training) Using a skeleton code in Section 3.5, write a script for train- 
ing the model generated in part (c) with cross entropy loss. Use the Adam 
optimizer with: 


learning rate = 0.001; (Bi, B2) = (0.9, 0.999) 


and the number of epochs is 10. Also plot a training loss as a function of 
epochs. 

(e) (Testing) Using a skeleton code in Section 3.5, write a script for testing the 
model (trained in part (d)). What is the test accuracy? 


Prob 8.11 (True or False?) 


(a) For two arbitrary distributions, say p and q, consider cross entropy H (p, q). 
Then, 


H (p,q) = H(q) (3.121) 


where H (q) is the Shannon entropy w.r.t. q. 
(b) For two arbitrary distributions, say p and g, consider cross entropy: 


1 
H(p, q) := — (x) log g(x) = E fio | (3.122) 
(p,9) 2? ) logg » |108 op 
where X € X is a discrete random variable. Then, 
H(p, q) = H) = — > p(x) log p(x). (3.123) 
xEX 


only when g = p. 

(c) Consider a binary classifier in the supervised learning setup where we are 
given input-output example pairs {(x, y)}™,. Let 0 < 7 < 1 be the 
classifier output for the 7th example. Let w be parameters of the classifier. 
Define: 


jee er 
* in— > e (v.53) 
Wee arg mn 5 = CE |) J 


1~ Se 
—— in—> KLD ( 0,30) 
We arg mo a I I 


i=1 


where €ce(-,) denotes cross entropy loss and KLD(y, 7) indicates the 
KL divergence between two binary random variables with parameters y® 
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(4d) 


(e) 


(f) 


(g) 


and 7, respectively. Then, 
* — * 
Wee = Wki: 


Consider a 2-layer DNN with one input neuron, one hidden neuron and 
one output neuron. We employ the /inear activation both for the hidden 
and output neurons: 


J = wwx (3.124) 


where x € R indicates an input; w € Rand w2 € R denote the weights for 
hidden and output layers, respectively. Let y € {—1, 1} be a label. Consider 
the squared error loss for an objective function: 


Tw, w) = lly — II’. (3.125) 


Then, the objective function is convex in (w1, w2). 
One of the reasons that DNNs were not appreciated much during the Al 
winter is that the DNN model was so complex in view of the technol- 
ogy of the day although it offers better performances relative to simpler 
approaches. 

Suppose we execute the following code: 

import numpy as np 

a = np.random.randn(4,3,3) 

b = np.ones_like(a) 

print(b[O].shape) 

print(b.shape[O]) 


Then, the two prints yield the same results. 
Suppose that image is an MNIST image of numpy array type. Then, one 
can use the following commands to plot the image: 


import matplotlib.pyplot as plt 
plt.imshow(Cimage.squeeze(), cmap='gray_r’) 
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3.6 Unsupervised Learning: Generative Modeling 


Recap During the past five sections, we have studied some basic contents on 
supervised learning. The goal of supervised learning is to estimate a function f (-) of 
an interested computer system (machine) from input-output samples, as illustrated 
in Fig. 3.28. 

In an effort to translate a function optimization problem (a natural formulation 
of supervised learning) into a parameter-based optimization problem that we are 
familiar with, we expressed the function with parameters (or called weights) assum- 
ing a certain architecture of the system. 

The certain architecture was: Perceptron. Taking the logistic function together 
with cross entropy loss, we obtained logistic regression. We then proved that 
logistic regression is optimal in a sense of maximizing the likelihood of training 
data. 

We next considered the Deep Neural Networks (DNNs) architecture for f(-), 
which has been shown to be more expressive. Since there is no theoretical basis 
on the choice of activation functions in the DNN context, we investigated only 
a rule-of-thumb which is common to use in the field: Taking ReLU at all hid- 
den neurons while taking the logistic function at the output layer. We have a 
theoretical justification only on the choice of a loss function: cross entropy loss. 
We have also learned that in many of the interested settings, optimization prob- 
lems for DNNs have no spurious local minima, although the problems are highly 
non-convex. This motivated the use of gradient descent for such problems. We 
also studied an efficient way of computing gradients: backpropagation, or simply 
called backprop. Lastly, we investigated how to implement neural networks via 
TensorFlow. 


(i) „Om 

eg) Fea 

Figure 3.28. Supervised learning: Learning the function f(-) of an interested system from 
data {(x, y)}™. 
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Outline What is next? In fact, we face one critical challenge in supervised learn- 
ing. The challenge is that it is not that easy to collect /abeled data in many realistic 
situations. In general, gathering labeled data is very expensive, as it usually requires 
extensive human-labour-based annotations. So people wish to do something with- 
out such labeled data. Then, a natural question arises. What can we do only with 
ZOA 

This is where the concept of unsupervised learning kicks in. Unsupervised learn- 
ing is a methodology for learning something about data {x}”,. You may then 
ask: What is something? There are a few candidates for such something to learn in 
the field. Depending on target candidates, there are different unsupervised learning 
methods. 

In this section, we will start investigating details on these. Specifically we are 
going to cover the following four stuffs. First of all, we will study what such can- 
didates for something to learn are. We will then investigate what the correspond- 
ing unsupervised learning methods are. Next we will focus on arguably the most 
prominent and fundamental learning method among them: Generative modeling. 
Finally, we will connect this to optimization of this book’s interest, by formulating 
an optimization problem for generative models. 


Candidates for something to learn There are three candidates for something 
to learn, from simple to complex. The first candidate, which is perhaps the simplest, 
is the basic structure of data. For instance, when {x;}”, indicates users/customers 
data, such basic structures could be membership of individuals, community type, 
gender type, or race type. For products-related data, it could be abnormal (defect) vs 
normal information. The second candidate is the one that we learned about in Part 
I, which is features: expressive (and/or compressed) components that well describe 
characteristics of data. The last is a sort of the most complex yet most fundamental 
information: the probability distribution of data, which allows us to create data as 
we wish. 


Three unsupervised learning methods Depending on which candidate we 
focus on, we have three different unsupervised learning methods. The first is cluster- 
ing, which serves to identify the basic structures of data. You may hear of k-means, 
k-nearest neighbors, community-detection, or anomaly-detection algorithms. All 
of these belong to this category. The second is feature learning (or called representa- 
tion learning), which allows us to extract some well-representative features. You may 
hear of autoencoder, matrix factorization, principal component analysis, or dictio- 
nary learning, all of which can be categorized into this class. The last is generative 
modeling, which enables us to create arbitrary examples that well mimick character- 
istics of real data. This is actually the most famous unsupervised learning method, 
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Richard Feynman ‘88 


Figure 3.29. Richard Feynman left a quote on the relationship between understanding 
and creating on a blackboard around right before he died in 1988. The quote says, “What 
| cannot create, | do not understand.” What this quote suggests is that being able to 
create convincing examples of data is a strong evidence of having understood it. 


which has received a particularly significant attention in the field nowadays. So in 
this book, we are going to focus on this method. 


Why is generative modeling prominent? Before explaining details on gen- 
erative modeling, let us say a few words about why generative modeling is 
most prominent in the field. We list three reasons below which we believe 
major. 

The first reason is somewhat related to a famous quote by Richard Feynman; see 
Fig. 3.29. Right before he died in 1988, he left an intriguing quote in his black- 
board: “What I cannot create, I do not understand.” What this quote implies in the 
context of unsupervised learning is that creating convincing examples of data is a 
necessary condition for complete understanding. In this regard, a generative model 
serves an important role as it enables us to create arbitrary yet plausible examples 
that mimick real data. 

The second reason is related to a recent breakthrough made in the history of the 
AI field by a research scientist, named Ian Goodfellow; see Fig. 3.30. During his 
PhD, he could develop a powerful generative model, which he named “Generative 
Adversarial Networks (GANs)” (Goodfellow et al, 2014). The GANs are shown 
to be extremely instrumental in a wide variety of applications, even not limited 
to the AI field. Such applications include: image creation, human image synthesis, 
image inpainting, coloring, super-resolution image synthesis, speech synthesis, style 
transfer, robot navigation, to name a few. Since it works pretty well, in 2019, the 
state of California passed a bill that would ban the use of GANs to make fake 
pornography without the consent of the people depicted. So the GANs have played 
a crucial role to popularize generative modeling. 

The third reason is related to optimization of this book’s interest. The GANs 
borrow very interesting ideas from optimization, thus making many optimization 
experts excited about the generative models. In particular, the duality theorems that 
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lan Goodfellow 2014 


Figure 3.30. lan Goodfellow, a young figure in the modern AI field. He is best known as 
the inventor of the Generative Adversarial Networks (GANs), which made a big wave in 
the history of the Al field. 


generative 


— fake data 
model 


t 


real data 


Figure 3.31. A generative model is the one that generates fake data which resembles 
real data. Here what resembling means in a mathematical language is that it has a similar 
distribution. 


we studied in Part II play a crucial role to understand the GANs as well as many 
GAN variants. 


Generative modeling Let us dive into details on generative modeling. Gener- 
ative modeling is a technique for generating fake data so that it has a similar dis- 
tribution as that of real data. See Fig. 3.31 for pictorial representation. The model 
parameters are learned via real data so that the learned model outputs fake data that 
resemble real data. Here an input signal can be either an arbitrary random signal or 
a specifically synthesized signal that forms the skeleton of fake data. The type of 
the input depends on applications of interest — this will be detailed later on. 


Remarks on generative models In fact, the problem of designing a generative 
model is one of the most important problems in statistics, so it has been a classical 
age-old problem in that field. This is because the major goal of the field of statistics 
is to figure out (or estimate) the probability distribution of data that arise in the real 
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world (that we call real data), and the generative model plays a role as a underlying 
framework in achieving the goal. Actually the model can do even more. It provides 
a concrete function block (called the generator in the field) which can create realistic 
fake data. There is a very popular name in statistics that indicates such a problem, 
that is the density estimation problem. Here the density refers to the probability 
distribution. 

As you may guess from the second reason mentioned above regarding why gen- 
erative modeling is prominent, this problem was not that popular in the AI field 
until very recently, precisely 2014 when the GANs were invented. 


How to formulate an optimization problem? Let us relate generative mod- 
eling to optimization of our interest. As mentioned earlier, we can feed some input 
signal (that we call fake input) which one can arbitrarily synthesize. Common ways 
employed in the field to generate them are to use Gaussian or uniform distribu- 
tions. Since it is an imput signal, we may wish to use a conventional “x” notation. 
So let us use x € R* to denote a fake input where k indicates the dimension of the 
signal. 

Notice that this has a conflict with real data notation {x}”,. To avoid the 
conflict, let us use a different notation, say {y}”,, to denote real data. Please 
don't be confused with labeled data. These are not labels. In fact, the convention 
in the machine learning field is to use a notation z to indicate a fake input while 
maintaining real data notation as {x“)}”_,. This may be another way to go; perhaps 
this is the way that you should take when writing papers. Anyhow let us take the 
first unorthodox yet reasonable option for this book. 

Let € R” bea fake output. Considering m examples, let {(x, j)}”, be such 
fake input-output m pairs and let {y}”_, be m real data examples. See Fig. 3.32. 


Goal Let G(-) be a function of the generative model. Then, the goal of the gen- 
erative model can be stated as: Designing G(-) such that 


GOL, ~ {yO in distribution. 


fake input , fake output 
generative 


{Oy TA model [7 pO 


t 


real data {y}™, 


Figure 3.32. Problem formulation for generative modeling. 
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Here what does it mean by “in distribution”? To make it clear, we need to quan- 
tify closeness between two distributions. One natural yet prominent approach 
employed in the statistics field is to take the following two steps: 


1. Compute empirical distributions or estimate distributions from {y}”_, and 
{(e, GO) ,- Let such distributions be: 


Qr, Qp 


for real and fake data, respectively. 

2. Next employ a well-known divergence measure in statistics which can serve to 
quantify the closeness of two distributions. Let D(-, -) be one such divergence 
measure. Then, the similarity between Qy and Q» can be quantified as: 


D(Qy, Q;). 


Taking the above natural approach, one can concretely state the goal as: Designing 


G(-) such that 
D(Qy, Qp) is minimized. 


Optimization under the approach Hence, under the approach, one can for- 
mulate an optimization problem as: 


m D(Qy, Qp). (3.126) 


As you may easily notice, there are some issues in solving the above problem (3.126). 
There are three major issues. 

The first is that it is function optimization which we are not familiar with. Notice 
that the optimization is over the function G(-). Second, the objective function 
D(Qy, Qp) is a very complicated function of the knob G(-). Note that Qọ is a 
function of G(-), as 7 = G(x). So the objective is a twice folded composite function 
of G(-). The last is perhaps the most fundamental issue. It is not clear as to how to 
choose a divergence measure D(-, -). 


Look ahead It turns out there are some ways to address the above issues. Inter- 
estingly, one such way leads to an optimization problem for GANs. So in the next 
section, we will study what that way is, and then will take the way to derive an 
optimization problem for GANs. 
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3.7 Generative Adversarial Networks (GANs) 


Recap In the previous section, we started investigating unsupervised learning. The 
goal of unsupervised learning is to learn something about data, which we newly 
denoted by {y}” ,, instead of {x} |. Depending on target candidates for some- 
thing to learn, there are a few unsupervised learning methods. Among them, we 
explored one prominent method, which is generative modeling. We formulated an 
optimization problem for generative modeling: 


an D(Qr, Qy) (3.127) 


where Qy and Qọ indicate the empirical distributions (or the estimates of the 
true distributions) for real and fake data, respectively; G(-) denotes the function 
of a generative model; and D(., -) is a divergence measure. We then encountered a 
couple of issues that arise in the problem: (i) it is a function optimization which we 
are not familiar with; (ii) the objective is a very complicated function of G(-); and 
(iii) it is not that clear as to how to choose D(-, +). 

At the end of the last section, we claimed that there are some ways to address such 
issues, and interestingly, one such way leads to an optimization problem for a recent 
powerful generative model, named Generative Adversarial Networks (GANS). 


Outline In this section, we are going to explore details on GANs. What we are 
going to do are three folded. First we will investigate what that way leading to GANs 
is. We will then take the way to derive an optimization problem for GANs. Lastly 
we will demonstrate that GANs indeed address the issues: (i) the GAN optimization 
problem is tractable; and (ii) the problem can also be expressed as that in (3.127). 


What is the way to address the issues? Remember one challenge that we 
faced in the optimization problem (3.127): D(Qy, Q) is a complicated function 
of G(-). To address this, we take an indirect way to represent D(Qy, Q). We first 
observe how D(Qy, Q;,) should behave, and then based on the observation, we 
will come up with an indirect way to mimic the behaviour. It turns out the way 
leads us to explicitly compute D(Qy, Qp). Below are details. 


How D(Qy, Qç) should behave? One observation that we can make is that if 
one can easily discriminate real data y from fake data y, then the divergence must be 
large; otherwise, it should be small. This naturally motivates us to: 


Interpret D(Qy, Qs) as the ability to discriminate. 
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We introduce an entity that can play this discriminating role. The entity is the one 
that Ian Goodfellow, the inventor of GAN, introduced, and he named it: 


Discriminator. 


Goodfellow considered a simple binary-output discriminator which takes as an 
input, either real data y or fake data ĵ. He then wanted to design D(-) such that 
D(-) well approximates the probability that the input (-) is real data: 


D(-) © P(() = real data). 


Noticing that 


P(y = real) = 1; 
P( = real) = 0, 


he wanted to design D(-) such that: 


D(y) is as large as possible, close to 1; 
D(j) is as small as possible, close to 0. 


See Fig. 3.33. 


How to quantity the ability to discriminate? Keeping the picture in Fig. 3.33 
in his mind, he wanted to quantify the ability to discriminate. To this end, he first 
observed that if D(-) can easily discriminate, then we should have: 


DO) T; 1- DG) *. 


One naive way to capture the ability is simply adding the above two terms. But 
Goodfellow did not take the naive way. Instead he took the following logarithmic 


summation: 
log D(y) + log(1 — D(j)). (3.128) 
real data 
y 
Discriminator D(y) t 
Die) D(g) L 
Å 
fake data 


Figure 3.33. Discriminator wishes to output D(-) such that D(y) is as large as possible 
while DỌ) is as small as possible. 
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real data 


Player #1 
Discriminator 


Player #2 


Generator 
fake data 


Figure 3.34. A two-player game for GAN: Discriminator D(-), wishes to maximize the 
quantified ability (3.129), while another player, generator G(-), wants to minimize (3.129). 


In NeurIPS 2016, Goodfellow gave a tutorial on GANs, mentioning that the prob- 
lem formulation was inspired by a paper published in AISTATS 2010 (Gutmann 
and Hyvärinen, 2010). See Eq. (3) in the paper. 

Making the particular choice, the ability to discriminate for m examples can be 
quantified as: 


> > log DU) + log(d — DG), (3.129) 
i=1 


A two-player game Goodfellow then introduced a two-player game in which 
player 1, discriminator D(-), wishes to maximize the quantified ability (3.129), 
while player 2, generator G(-), wants to minimize (3.129). See Fig. 3.34 for 
illustration. 


Optimization for GANs The two-player game motivated him to formulate the 
following min max optimization problem: 


m 


1 
min max — X` log D(y) + log(1 — D®)). (3.130) 
min mas ne g Dy) + log — DG) 


You may wonder why not max min. That may be another way to go, but Good- 
fellow made the above choice. In fact, there is a reason why the way is taken. This 
will be clearer soon. Notice that the optimization is over the two functions of D(-) 
and G(-), meaning that it is still a function optimization. Luckily the year of 2014 
(when the GAN paper was published) was after the starting point of the deep learn- 
ing revolution, the year of 2012. So Goodfellow was very much aware of the power 
of neural networks: 


“Deep neural networks can well represent any arbitrary function." 
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This motivated him to parameterize the two functions with DNNs, which in turn 
led to the following optimization problem: 


log Diy) + log — DG” 131 
PR aD og Dy’) + log( y”)) (3.131) 


where N denotes a set of DNN-based functions. This is exactly the optimization 
problem for GANs. 


Related to original optimization? Remember what we mentioned earlier. The 
way leading to the GAN optimization is an indirect way of solving the original 
optimization problem: 


m D(Qy, Qp). (3.132) 


Then, a natural question arises. How are the two problems (3.131) and (3.132) 
related? It turns out these are very much related. This is exactly where the choice 
of min max (instead of max min) plays the role; the other choice cannot establish a 
connection. It has been shown that assuming that deep neural networks can repre- 
sent any arbitrary function, the GAN optimization (3.131) can be translated into 
the original optimization form (3.132). We will prove this below. 


Simplification & manipulation Let us start by simplifying the GAN optimiza- 
tion (3.131). Since we assume that M can represent any arbitrary function, the 
problem (3.131) becomes unconstrained: 


log Diy) + log(1 — DG (3.133) 
nine D loe On") + log( gy). 


Notice that the objective is a function of D(-), and the two functions D(-)’s 
appear but with different arguments: one is y“), marked in blue; the other is 7”, 
marked in red. So in the current form (3.133), the inner (max) optimization prob- 
lem is not quite tractable to solve. In an attempt to make it tractable, let us express 
it in a different manner using the following notations. 

pa a random vector Y which takes one of the m real examples with proba- 
bility + (uniform distribution): 


ool 
Ye... 7} = V; Qro) = O O 


where Qy indicates the probability distribution of Y. Similarly define Y for fake 


examples: 


a a(m á a(i 1, 
e 7,-2.5 Nay; Q0) =, ie {1,2,...,m} 
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where Qs, indicates the probability distribution of Ê. Using these notations, one 
can rewrite the problem (3.133) as: 


min max S QrG) log DY) + QGP) loga — DG). (3.134) 
7 i=l 
Still we have different arguments in the two D(-) functions. 
To address this, let us introduce another notation. Let z € VY U Y. Newly define 


Qy(-) and Qg (-) such that: 


Qy(z) := 0 ifz € Ŷ \ V; (3.135) 
Qp(z) = 0 ifze YV\ P. (3.136) 


Using the z notation, one can then rewrite the problem (3.134) as: 


a red > Qr (z2) log D(z) + Qg (z) log — D(z)). (3.137) 


We see that the same arguments appear in the two D(-) functions. 


Solving the inner optimization problem Weare ready to solve the inner opti- 
mization problem in (3.137). Key observations are: log D(z) is concave in D(-); 
log(1 — D(z)) is concave in D(-); and therefore, the objective function is concave 
in D(-). This implies that the objective has the unique maximum in the function 
space D(-). Hence, one can find the maximum by searching for the one in which 
the derivative is zero. Taking a derivative and setting it to zero, we get: 


Ore) Qe 7 _ 
Derivative = 2d E E epee =| =0 


Hence, we get: 


‘ Qr&) a 
D = —— V. : .138 
ATTO “errr cae 
Plugging this into (3.137), we obtain: 
, Qr (2) Q;(z) 
min lbg- > (z) log —— = ——, 
ais DL, Qr@Olbe a + a@ tO aD +G@ 


zeyuy 
(3.139) 
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Jensen-Shannon divergence Let us massage the objective function in (3.139) 
to express it as: 


Qr@) Qpe) 


ae >, Webs Tora Qy TOA @ + Qel greoa 282 
rd 


zeyuUð 


(3.140) 


The above underbraced term can be expressed with a well-known divergence mea- 
sure in statistics, called Jensen-Shannon divergence’: for any two distributions, say p 


and q, 
JSD(p, 4) = 2 ie a 2 ake (3.141) 
9) := PE) 08 SOO pera) 1\2) 108 TOO erg) i 


This is indeed a valid divergence measure, i.e., it is non-negative, being equal to 
zero if and only if p = q. We will not prove this here, but you will have a chance 
to prove it in Prob 9.1. 


Equivalent form Using the divergence, one can then rewrite the prob- 
lem (3.140) as: 


mi 2JSD(Qy, Q;) — 2log2. (3.142) 
Hence, we get: 
Goan = arg tnin JEDY Ug). (3.143) 


We see that this indeed belongs to the original optimization form (3.132) if one 
makes a choice as: D(-,-) = JSD(-, +). 


Look ahead So far we have formulated an optimization problem for GANs and 
made an interesting connection to the Jensen-Shannon divergence. In the next sec- 
tion, we will study how to solve the GAN optimization (3.131) and implement it 
via TensorFlow. 


6. One may guess that this is the divergence that Johan Jensen (the inventor of Jensen’s inequality) and Claude 
Shannon (the father of information theory) developed. But it is not the case. Johan Jensen died in 1925 
when Claude Shannon was a child, so there was no collaboration between the two. Actually it was invented 
much later days in 1991 by a Taiwanese information theorist, named Jianhua Lin (Lin, 1991). 
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3.8 GANs: TensorFlow Implementation 


Recap In the prior section, we investigated Goodfellow’s approach to deal with 
an optimization problem for generative modeling, which in turn led to GANs. He 
started with quantifying the ability to discriminate real against fake samples: 


Z Š log Dy) + logd — DG) (3.144) 


i=1 


where y and 5 := G(x) indicate real and fake samples, respectively; D(-) 
denotes the output of discriminator; and m is the number of examples. He then 
introduced two players: (i) player 1, discriminator, who wishes to maximize the 
ability; (ii) player 2, generator, who wants to minimize it. This naturally led to the 
optimization problem for GANs: 


bey DON m a DY) + log — D(G&))) (3.145) 


where M denotes a class of neural network functions. Lastly we demonstrated that 
the problem (3.145) can be stated in terms of the Jensen-Shannon divergence, thus 
making a connection to statistics. 

Two natural questions arise. First, how to solve the problem (3.145)? Second, 
how to do programming via a popular deep learning framework TensorFlow? 


Outline In this section, we will answer these two questions. Specifically we are 
going to cover four stuffs. First we will investigate a practical method to solve the 
problem (3.145). We will then do one case study for the purpose of exercising the 
method. In particular, we will study how to implement an MNIST style hand- 
written digit image generator. Next we will explore one important implementation 
detail: Batch Normalization (Ioffe and Szegedy, 2015), particularly useful for very 
deep neural networks. Lastly we will learn how to write a TensorFlow script for 
software implementation. 


Parameterization Solving the problem (3.145) starts with parameterizing the 


two functions G(-) and D(-) with neural networks: 


1 
min max — > log Do (y) + log(1 — Do (Gy(x))) (3.146) 
ij eom i=1 


= (w,9) 


where w and @ indicate parameters for G(-) and D(-), respectively. Now the ques- 
tion of interest is: Is the parameterized problem (3.146) the one that we are familiar 
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with? In other words, is /(w,@) is convex in w? Is ](w, @) is concave in 0? Unfor- 
tunately, it is not the case. In general, the objective is highly non-convex in w and 
also highly non-concave in 8. 

Then, what can we do? Actually there is nothing we can do more beyond what 
we know. We only know how to find a stationary point via a method like gradient 
descent. So one practical way that we can take is to simply look for a stationary 
point, say (w*,@*), such that 


Vaj (w*,0*) = 0, Voj (w*,0*) = 0, 


while cross-fingering that such a point yields a near optimal performance. It turns 
out luckily it is often the case in reality, especially when employing neural networks 
for parameterization. There have been huge efforts by many theorists in figuring 
out why that is the case, e.g., (Arora eż al, 2017). However, a clear theoretical 
understanding is still missing despite their serious efforts. 


Alternating gradient descent One practical method to attempt to find (yet 
not necessarily guarantee to find) such a stationary point in the context of the min- 
max optimization problem (3.146) is: alternating gradient descent. Actually we saw 
this in Section 2.2. For those who do not remember, let us explain again how it 
works yet in the context of GANS. 

At the ¢th iteration, we first update generator’s weight: 


per) gg) — a1 Vy Jw, 0) 


where w) and 6 denote the weights of generator and discriminator at the th 
iteration, respectively; and a is the learning rate for generator. Given (w+), 9), 
we next update discriminator’s weight as per: 


eet) e gM 4 a2 VoJ (wt), 9) 


where @2 is the learning rate for discriminator. Here one important thing to notice 
is that we should perform gradient ascent, i.e., the direction along which the dis- 
criminator’s weight is updated should be aligned with the gradient, reflected in the 
plus sign colored in blue. Lastly we repeat the above two until converged. 

In practice, we may wish to control the frequency of discriminator weight update 
relative to that of generator. To this end, we often employ & : 1 alternating gradient 
descent: 


1. Update generator’s weight: 


wht) e wo — a1 Vy (w, 0%"), 
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2. Update discriminator’s weight & times while fixing w+): for i=1:k, 
gtk+) — Qekti—-1) at a Voj] (wt), atid), 


3. Repeat the above. 


You may wonder why we update discriminator more frequently than generator. 
Usually more updates in the inner optimization yield better performances in prac- 
tice. Further, we often employ the Adam counterpart of the algorithm together 
with batches. Details are omitted although we will apply such a practical version 
for programming assignment in Prob 9.4. 


A practical tip on generator Before moving onto a case study for implemen- 
tation, let us say a few words about generator optimization. Given discriminator’s 
parameter 0: the generator wishes to minimize: 


1 ; ; 
min ae >, log Do(y) + log — Do (Gu (x®))) 
” B cB 


where B indicates a batch of interest and mg is the batch size (the number of exam- 
ples in the interested batch). Notice that log Dg(y) in the above is irrelevant of 
generator'’s weight w. Hence, it suffices to minimize the following: 


1 . 
min = S| log(1 — Da(Gir(x))) 
icb 


generator loss 


where the underbraced term is called “generator loss”. However, in practice, instead 
of minimizing the generator loss directly, people often rely on the following proxy: 


1 
min — *_ — log Do (Gy(x)). (3.147) 
w mg! 
ieB 


You may wonder why. There is a technically detailed rationale behind the use of 
the proxy for the generator loss. Check this in Prob 9.2. 


Task Let us discuss one case study for implementation. The task that we will focus 
on is the one related to the simple digit classifier that we exercised on in Section 3.5. 
The task is to generate MNIST style handwritten digit images, as illustrated in 
Fig. 3.35. Here we intend to train generator with MNIST dataset so that it outputs 
an MNIST style fake image when fed by a random input signal. 
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Figure 3.35. Generator for MNIST-style handwritten digit images. 
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Figure 3.36. Generator: A 5-layer fully-connected neural network where the input size 
(the dimension of a latent signal) is 100; the numbers of hidden neurons are 128, 256, 512, 
1024; and the output size is 784 (=28 x 28). We employ ReLU activation for every hidden 
layer, and logistic activation for the output layer to ensure O-to-1 output signals. We also 
use Batch Normalization prior to ReLU at each hidden layer. See Fig. 3.37 for details. 


Model for generator Asa generator model, we employ a 5-layer fully-connected 
neural network with four hidden layers, as depicted in Fig. 3.36. For activation at 
each hidden layer, we employ ReLU. Remember that an MNIST image consists of 
28-by-28 pixels, each indicating a gray-scaled value that spans from 0 to 1. Hence, 
for the output layer, we use 784 (= 28 x 28) neurons and logistic activation to 
ensure the range of [0, 1]. 

The employed network has five layers, so it is deeper than the 2-layer case that 
we used earlier. In practice, for a somewhat deep neural network, each layer’s signals 
can exhibit quite different scalings. It turns out such dynamically-swinged scaling 
yields a detrimental effect upon training: unstable training. So in practice, people 
often apply an additional procedure (prior to ReLU) so as to control the scaling in 
our own manner. The procedure is called: Batch Normalization. 
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A hidden layer BN 


1. Normalization component-wise 
Z — HB 


Znorm = = 
Vog TE 


1 ; 1 ; 
as (4) .2 _ (t) _ 2 
be = 29) ob = (2 — ue) 


iEB mg i€B 
2. Customized scaling 
Žž = YZnorm + p 


7,3 € R” (learnable parameters) 


SIOIOIOICIORSIS) 
29161910. 015A 


Figure 3.37. Batch Normalization (BN): First we do zero-centering and normalization 


with the mean ug and the variance oR computed over the examples in an associated 


batch B. Next we do a customized scaling by introducing two new parameters that would 
also be learned during training: y e R” and £ e R”. 


Batch Normalization (loffe and Szegedy, 2015) Here is how it works; see 
Fig. 3.37. For illustrative purpose, focus on one particular hidden layer. Let z := 
[z1,.--sZn]7 be the output of the considered hidden layer prior to activation. Here 
n denotes the number of neurons in the hidden layer. 

Batch Normalization (BN for short) consists of two steps. First we 
do zero-centering and normalization using the mean and variance w.r.t. examples 
in an associated batch B: 


1 . 1 ; 
= (1) 2- (2) 2 
us =— > z, 0 =—)> (z™ — up) (3.148) 
mB icb ° mB icb 


where (-)* indicates a component-wise square, and hence oj; € R”. In other words, 
we generate the normalized output, say Znorms as: 


2 — yp 
Znorm = = 
Jogte 


where division and multiplication are all component-wise. Here € is a tiny value 


(3.149) 


introduced to avoid division by 0 (typically 1075). 
Second, we do a customized scaling as per: 


zO = yz +8 (3.150) 


where y , 8 € R” indicate two new scaling parameters which are learnable via train- 
ing. Again the operations in (3.150) are all component-wise. 
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Figure 3.38. Discriminator: A 3-layer fully-connected neural network where the input size 
(the dimension of a flattened vector of a real (or fake) image) is 784 (=28 x 28); the num- 
bers of hidden neurons are 512, 256; and the output size is 1. We employ ReLU activation 
for every hidden layer, and logistic activation for the output layer. 


BN lets the model learn the optimal scale and mean of the inputs for 
each hidden layer. This technique is quite instrumental in stabilizing and speed- 
ing up training especially for a very deep neural network. This has been verified 
experimentally by many practitioners, although no clear theoretical justification 
has been provided thus far. 


Model for discriminator Asa discriminator model, we use a simple 3-layer fully- 
connected network with two hidden layers; see Fig. 3.38. Here the input size must 
be the same as that of the flattened real (or fake) image. Again we employ ReLU at 
hidden layers and logistic activation at the output layer. 


TensorFlow: How to use BN? Let us talk about how to do TensorFlow pro- 
gramming for implementation. Loading MNIST data is exactly the same as before. 
So we omit it. Instead let us discuss how to use BN. 

As you expect, TensorFlow provides a built-in class for BN: 


BatchNormalizationO 
This is placed in tensorflow.keras.layers. Here is how to use the class in our setting: 


from tensorflow.keras.models import Sequential 

from tensorflow.keras.layers import Dense 

from tensorflow.keras.layers import BatchNormalization 
from tensorflow.keras.layers import ReLU 


generator = Sequential() 
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generator.add(Dense(128,input_dim=latent_dim)) 
generator.add(BatchNormalizationQ) 
generator.add(ReLUQ) 
generator.add(Dense(256)) 
generator.add(BatchNormalizationO) 

# 


where latent_dim is the dimension of the latent signal (which we set as 100). 


TensorFlow: Models for generator & discriminator Using the DNN archi- 
tectures for generator and discriminator illustrated in Figs. 3.36 and 3.38, we can 
implement a code as below. 


from tensorflow.keras.models import Sequential 

from tensorflow.keras.layers import Dense 

from tensorflow.keras.layers import BatchNormalization 
from tensorflow.keras.layers import ReLU 


latent_dim =100 

generator=Sequential( 
generator.add(Dense(128,input_dim=latent_dim)) 
generator.add(BatchNormalizationQ) 
generator.add(ReLUQ) 
generator.add(Dense(256)) 
generator.add(BatchNormalizationQ) 
generator.add(ReLUQ) 
generator.add(Dense(512)) 
generator.add(BatchNormalizationQ) 
generator.add(ReLUQ) 
generator.add(Dense(1024)) 
generator.add(BatchNormalizationQ) 
generator.add(ReLUQ) 
generator.add(Dense(28*28,activation=’sigmoid’)) 


from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import Dense 
from tensorflow.keras.layers import ReLU 


discriminator=SequentialQ 
discriminator.add(Dense(512,input_shape=(784,))) 
discriminator.add(ReLUQ) 
discriminator.add(Dense(256)) 
discriminator.add(ReLUQ) 
discriminator.add(Dense(1,activation= ’sigmoid’)) 
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TensorFlow: Optimizers for generator & discriminator We use Adam opti- 
mizers with Ir=0.0002 and (b1,62)=(0.5,0.999). Since we have two models (gen- 
erator and discriminator), we employ two optimizers accordingly: 

from tensorflow.keras.optimizers import Adam 

Ir = 0.0002 

b1 = 0.5 

b2 = 0.999 # default choice 

optimizer_G = Adam(learning_rate=lr, beta_1=b1) 

optimizer_D = Adam(learning_rate=lr, beta_1=b1) 


TensorFlow: Generator input As a generator input, we use a random signal 
with the Gaussian distribution. In particular, we use: 


ioe Riatent.dim ~ N (0, Iiatent dim). 


Here is how to generate the Gaussian random signal in TensorFlow: 


from tensorflow.random import normal 
x = normal([batch_size,latent_dim]) 


TensorFlow: Binary cross entropy loss Consider the batch version of the 
GAN optimization (3.146): 


1 ; ; 
min max os > log Do (y) + log(1 — Do (Gy(x))). (3.151) 
” B iB 


Now introduce the ground-truth real-vs-fake indicator vector [1,0]? (real=1, 
fake=0). Then, the term log Dọ(y®) can be viewed as the minus binary cross 
entropy between the real/fake indicator vector and its prediction counterpart 
[Do), 1 = Day)": 


log Do (y) = 1 - log Do (y) + 0 - log — Do(y)) 


m (3.152) 
—fece(l, Do (y )). 


On the other hand, another term log(1 — Do ($®)) can be interpreted as the minus 
binary cross entropy between the fake-vs-real indicator vector (fake=0, real=1) and 
its prediction counterpart: 


= —fsce(0, Do G)). 


From this, we see that cross entropy plays a role in computation of 


(3.153) 


the objective function. TensorFlow offers a built-in class for cross entropy: 
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BinaryCrossentropy(). This is placed in tensorflow.keras.losses. Here is how to 
use it in our setting: 
from tensorflow.keras.losses import BinaryCrossentropy 


CE_loss = BinaryCrossentropy(from_logits=False) 
loss = CE_loss(real_fake_indicator, output) 


where output denotes an output of discriminator, and real_fake_indicator is 
real/fake indicator vector (real=1, fake=0). Here one important thing to notice is 
that output is the result after logistic activation; and real fake indicator is also a 
vector with the same dimension as output. The function BinaryCrossentropy() 
automatically detects the number of examples in an associated batch, thus yielding 
a normalized version (through division by mg). 


TensorFlow: Generator loss Recall the proxy (3.147) for the generator loss 
that we will use: 


1 ; 
min — >a — log Do (Gy(x)) 


w mB * 
an | (3.154) 
= min — 5 lece (1 Do(Gue))) 

0 mB iieb 


where (a) follows from (3.152). Hence, we can use the function CE_loss imple- 
mented above to easily write a code as below: 


g_loss = CE_loss(valid, discriminator(gen_imgs)) 


where gen_imgs indicate fake images (corresponding to G,,(x)’s) and valid 
denotes an all-1’s vector with the same dimension as gen imgs. 


TensorFlow: Discriminator loss Recall the batch version of the optimization 
problem: 


ise — > log Day) + log(1 — Do (Gy(x))). 


0 
ms icb 


Taking the minus sign in the objective, we obtain the equivalent minimization 
optimization: 


1 ; . 
min — = — log Do (y) — log(1 — Do (Gy(x))) 
icb 


discriminator loss 
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Léon Bottou 2017 


Figure 3.39. Léon Bottou is the inventor of Wasserstain GANs. He is another big figure 
in the Al field. 


where the discriminator loss is defined as the minus version. Using (3.152) 
and (3.153), we can implement the discriminator loss as: 
real_loss = CE_loss(valid, discriminator(real_imgs)) 


fake_loss = CE_loss(fake, discriminator(gen_imgs)) 
d_loss = real_loss + fake_loss 


where real imgs indicate real images (corresponding to y’s) and fake denotes an 
all-0’s vector with the same dimension as gen imgs. 


TensorFlow: Training Using all of the above, one can implement a code for 
training. Here we omit details. But you will have a chance to be guided in detail in 
Prob 9.4. 


Look ahead We have investigated the GAN optimization problem together with 
its TensorFlow implementation. While GANs work well in practice, there is one 
critical issue which we did not delve into. In fact, the issue can be understood 
from the equivalent form of the GAN optimization that we derived in the previous 
section: 


Gean = arg min JSD(Qy, Qy). (3.155) 


The issue could be figured out by the team of Professor Léon Bottou, a computer 
scientist as well as a mathematician. See Fig. 3.39 for his portrait. Since he is strong 
at math and stats, he could understand that the critical issue comes from some 
undesirable property of JSD. More importantly, he knew how to address the issue. 
In the course of addressing the issue, he could develop a variant of GAN, which he 
called (Arjovsky et al., 2017): 


Wasserstein GAN. 


In the next section, we will figure out what that critical issue is. We will then inves- 
tigate how Bottou came up with Wasserstein GAN. 
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Problem Set 9 


Prob 9.1 (Jensen-Shannon divergence) Recall the Jensen-Shannon diver- 
gence that we encountered in Section 3.7. Let p and q be two distributions. 


(a) Show that 
JSD(p, q) = JSD(q, p). (3.156) 

(6) Show that 
JSD(p, q) > 0. (3.157) 


Also identify conditions under which the equality in (3.157) holds. 
Prob 9.2 (Proxy for generator loss in GAN) Consider the optimization prob- 


lem for GAN that we learned in Section 3.7: 


— Slog Diy) + log — DG 3.158 
cain De eee IM, ae 


where M indicates a neural network, and y and 7 := G(x) denote real and 
fake samples respectively. Here x“) denotes an input to the generator that one can 


synthesize arbitrarily, and m is the number of examples. Suppose that the inner 
optimization is solved to yield D*(-). Then, the optimization problem reduces to: 


— X log D* (y) + log(1 — D* GO ‘al 
Da ee =e O“) + log( 0”). (3.159) 


(a) Show that the optimization problem (3.159) is equivalent to: 


—S“log(1 — D*0® 3.160 
Gaia 2 og( 0”). ( ) 


(b) Let w be the weights of the generator model. Show that 


dlog(1 — GOÐ _ 1 dD* (5) dy a 
da =PGO)=1 GO dw? Prev 

— log D* GO -1 dD* (9) dO 
d(— log DGO) _ dD* GO) dj ae 


dw — DOO dO dw’ 
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(c) Suppose that the Discriminator works almost optimally, i.e., D* (7) is very 


close to 0. Which is larger in magnitude between (3.161) and (3.162)? 
Instead of solving (3.160), many researchers including the inventors of 
GAN prefer to solve the following for G(-): 


m 


1 ; 
in — >) -log D*®). 1 
GOEN m aE) nee! 


Explain the rationale behind this alternative. 


Prob 9.3 (Batch normalization (loffe and Szegedy, 2015)) Consider a 
deep neural network. Let ZO := [z® on 2] T be the output of a hidden layer 
prior to activation for the ‘th example where 7 € {1,2,..., m} and m is the number 


of examples. Here n denotes the number of neurons in the hidden layer. 


(a) 


(b) 


Let 
I&A a 2 1 @ 2 
w=—) 29, 0? =— D9 - yn) (3.164) 
i=1 i=1 


where (-)* indicates a component-wise square, and hence o? € R”. Consider 


© 
2 ia ia 


norm — (3.165) 
Vo? +e 
ZO =y 48 (3.166) 


where y, 6 € R”. Again the division and multiplication are all component- 
wise. Here € is a tiny value introduced to avoid division by 0 (typically 
1075). This is called a smoothing term. Assuming that € is negligible and 
z®’s are independent over i, what are the mean and variance of Z”)? 
Many researchers often employ 2” instead of z during training. These 
operations include zero-centering and normalization (hence it is named 
batch normalization), followed by rescaling and shifting with two new 
parameters (y and J) which are learnable via training. In other words, these 
operations let the model learn the optimal scale and mean of the inputs for 
each layer. It turns out this technique plays a significant role in stabilizing 
and speeding up training especially for a very deep neural network. This has 
been verified experimentally by many practitioners, but no clear theoretical 
justification has been provided thus far. 
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In practice, this operation is done over the current mini-batch, so the 
whole procedure is summarized as follows: for the current mini-batch B 
with the size mp, 


(3.167) 


At test time, there is no mini-batch to compute the empirical mean and 
standard deviation. What can we do then? Suggest a way to handle this 
issue and also explain the rationale behind your suggestion. You may want 
to consult with some well-known literature if you wish. 


Prob 9.4 (TensorFlow implementation of GAN) Consider the GAN that 
we learned in Section 3.7. In this problem, you are asked to build a simple GAN 
that generates MNIST style handwritten digit images. We employ a 5-layer neural 
network for generator with ReLU in all the hidden layers and logistic activation in 
the output layer. 


(a) (MNIST dataset loading) Use the following script (or otherwise), load the 
MNIST dataset: 


from tensorflow.keras.datasets import mnist 
(X_train,y_train),(X_test, y_test)=mnist.load_data() 
X_train = X_train/255. 

X_test = X_test/255. 


Explain the role of the following script: 


import numpy as np 
def get_batches(data, batch_size): 
batches = [] 
for i in rangecint(data.shape[O] // batch_size)): 
batch=data[i*batch_size:(i +1)*batch_size] 
batches.append(batch) 
return np.asarray(batches) 


(b) (Data visualization) Assume that the code in part (a) is executed. Using a 
skeleton code provided in Prob 8.10(b), write a script that plots 60 images 
in the first batch of X_train in one figure. Also plot the figure. 
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(c) (Generator) Draw a block diagram for generator implemented by the fol- 
lowing: 


from tensorflow.keras.models import Sequential 

from tensorflow.keras.layers import Dense 

from tensorflow.keras.layers import BatchNormalization 
from tensorflow.keras.layers import ReLU 


latent_dim =100 

generator=SequentialQ 
generator.add(Dense(128,input_dim=latent_dim)) 
generator.add(BatchNormalizationQ) 
generator.add(ReLUQ) 
generator.add(Dense(256)) 
generator.add(BatchNormalizationO) 
generator.add(ReLUQ) 
generator.add(Dense(512)) 
generator.add(BatchNormalizationQ) 
generator.add(ReLUQ) 
generator.add(Dense(1024)) 
generator.add(BatchNormalizationQ) 
generator.add(ReLUQ) 
generator.add(Dense(28*28,activation='sigmoid’)) 


(d) (Generator check) Upon the above codes being executed, report an output 
for the following: 


from tensorflow.random import normal 
import matplotlib.pyplot as plt 


batch_size = 64 

x = normal([batch_size,latent_dim]) 
gen_imgs = generator.predict(x) 
gen_imgs = gen_imgs.reshape(-1,28,28) 


num_of_images = 60 

for index in range(i,num_of_images+1): 
plt.subplot(6,10, index) 
plt.axisC off’) 
plt.imshow(gen_imgs[index], cmap = ‘gray_r’) 


(e) (Discriminator) Draw a block diagram for discriminator implemented by 
the following: 
from tensorflow.keras.models import Sequential 


from tensorflow.keras.layers import Dense 
from tensorflow.keras.layers import ReLU 
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discriminator=SequentialQ 
discriminator.add(Dense(512,input_shape=(784,))) 
discriminator.add(ReLUQ) 
discriminator.add(Dense(256)) 
discriminator.add(ReLUQ) 
discriminator.add(Dense(l,activation= ’sigmoid’)) 


(f) (Training) Suppose we create the generator and discriminator as follows: 


from tensorflow.keras.layers import Input 
from tensorflow.keras.models import Model 
from tensorflow.keras.optimizers import Adam 


adam = Adam(learning_rate=0.0002, beta_1=0.5) 


# discriminator compile 

discriminator.compile(loss=’binary_crossentropy’, 
optimizer=adam) 

# fix disc’s weights while training generator 

discriminator.trainable = False 


# define GAN with fake input and disc. output 

gan_input = Input(shape=(latent_dim,)) 

x = generator(inouts=gan_input) 

output = discriminator(x) 

gan = Model(gan_input, output) 
gan.compile(loss=’binary_crossentropy’, optimizer=adam) 


where generator() and discriminator() are the classes designed in parts (c) 
and (e), respectively. 
Now explain how generator and discriminator are trained in the follow- 
ing code: 
import numpy as np 
from tensorflow.random import normal 


EPOCHS = 50 

k=2 # k:1 alternating gradient descent 
d_losses = [] 

g_losses = [] 


for epoch in range(l,EPOCHS + 1): 
# train per each batch 
np.random.shuffle(X_train) 
for i, real_imgs in enumerate(get_batches(X_train, batch_size)): 
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HHHHHHHHHHHHHHHRHAHAHH 

# train discriminator 
HHHHHHRHHHHHHHHHRHAAAHH 

# fake input generation 

gen_input = normal([batch_size,latent_dim]) 

# fake images 

gen_imgs = generator.predict(gen_input) 
real_imgs = real_imgs.reshape(-1,28*28) 

# input for discriminator 

d_input = np.concatenate([real_imgs,gen_imgs]) 
# label for discriminator 

# (first half: real (1); second half: fake (O)) 
d_label = np.zeros(2*batch_size) 
d_label[:batch_size] = 1 

# train Discriminator 

d_loss = discriminator.train_on_batch(d_input, d_label) 


HHHHHHHHHHHHHHEHAAAAHE 
# train generator 
HHHHHHHHHHHHHRHHAHAHHE 
if i%k: # 7:k alternating gradient descent 
# fake input generation 
g_input = normal ([batch_size,latent_dim]) 
# label for fake image 
# Generator wants fake images to be treated 
# as real ones 
g_label = np.ones(batch_size) 
# train generator 
g_loss = gan.train_on_batch(g_iput, g_label) 


d_losses.append(d_loss) 
g_losses.append(g_loss) 


(g) (Training check) For epoch= 10, 30, 50, 70, 90: plot a figure that shows 25 
fake images from generator trained in part (f) or by other methods of 
yours. Also plot the generator loss and discriminator loss as a function of 
epochs. Include Python scripts as well. 


Prob 9.5 (Minimax theorem) Let f(x,y) be a continuous real-valued function 
defined on X x Y such that 


(i) f(x,y) is convex in x € X Vy € YV; and 
(ii) f(x, y) in concave in y € Y Yx € X 


where ¥ and yY are convex and compact sets. 
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Note: You do not need to solve the optional problems below. 


(a) Show that 


i NW Z i J). .168 
ap ines te y) > ere y) (3.168) 


xEX ye 


Does (3.168) hold also for any arbitrary function f(-, -)? 
(4) Suppose 


i ; < i y). wl 
ae es a) = sen) (3.169) 
Then, argue that (3.169) implies: 


i n< i ,y). .170 
min max f(xy) = meme 2) (3.170) 


xEX ye 


(c) Suppose thata < minyex maxyey f (x, y). Then, show that there are finite 
Yi» ++ -Yn E Y such that 


a<min max f(x,y). (3.171) 


(d) (Optional) Suppose that a < minyex MaXye{y y2} f (x,y) for any y1, y2 € 
V. Then, show that there exists yọ € VY such that 


a < min f(x, yo). (3.172) 
xEX 


(e) (Optional) Suppose that a < minyex Maxyefy,,....7,} fy) for any finite 
Yi» -- ->Yn E VY. Then, show that there exists yop € VY such that 


a < min f(x, yo). (3.173) 
xEX 


Hint: Use the proof-by-induction and part (d). 


Note: (3.173) implies that a < max,ey minyex f(x,y). This together with the 
results in parts () and (c) proves (3.170). Combining this with (3.168) proves the 
minimax theorem: 


i VW= i , y). 3.174 
min P (x,y) n min f (x, y) (3.174) 
Prob 9.6 (Training instability) Consider a function: 


f(x,y) = (2 + cos x)(2 + cos y) (3.175) 


where x,y € R. 
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(a) Solve the following optimization (i.e., find the optimal solution as well as 
the points that achieve it): 


min max f (x, y). (3.176) 
xy 


(b) Solve the reverse version of the optimization: 


max min f (x, y). (3.177) 


(c) Suppose that we perform 1 : 1 alternating gradient descent for f (x, y) with 
an initial point (x, y) = (x +0.1, —0.1). Plot f (x®, y®) as a function 
of ¢ where (x, y) denotes the estimate at the ¢th iteration. What are the 
limiting values of (x, y®)? Also explain why. 

Note: You may want to set the learning rates properly so that the convergence 
behaviour is clear. 

(d) Redo part (c) with a different initial point (x,y) = (0.1, — 0.1). 


Prob 9.7 (Alternating gradient descent) Consider a function: 


fey) =x -y (3.178) 


where x,y € R. 


(a) Solve the following optimization: 


min max f(x, 9). (3.179) 


(b) Suppose that we perform 1 : 1 alternating gradient descent for f (x, y) with 
an initial point (x, y) = (1, 1). Plot f(x, y) as a function of t where 
(x, y) denotes the estimate at the rth iteration. What are the limiting 
values of (x, y)? Also explain why. 

Note: You may want to set the learning rates properly so that the convergence 
behaviour is clear. 

(c) Redo part (c) with a different initial point (4, y) = (-1,-1). 


Prob 9.8 (True or False?) 


(a) Consider the following optimization: 


min max x” =}. 
xeR yeR 


With 1:1 alternating gradient descent with a proper choice of the learning 
rates, one can achieve the optimal solution to the above. 
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(6) Consider the following optimization: 


min max(2 + cos x)(2 + cosy). 
xER yeR 
Suppose we perform 1:1 alternating gradient descent with a proper choice 
of the learning rates. Then, the converging points can be distinct depending 
on different initial points. 
(c) Autoencoder can be categorized as a feature learning method. 
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3.9 Wasserstein GAN | 


Recap During a couple of past sections, we formulated the optimization problem 


for GANs. Given data py D 


log Diy) + log — DG J 
Ea =e og Dy”) + log( or”) (3.180) 


where G(-) and D(-) indicate the functions of the generator and the discriminator, 
respectively; and M denotes a set of DNN-based functions. We then showed that 
the GAN optimization belongs to a generic divergence-based optimization problem 
(an age-old problem in statistics), assuming that DNNs can represent any arbitrary 
function. 

At the end of the previous section, however, we claimed that there is a critical 
issue in the GAN optimization, and this is what Léon Bottou figured out. We also 
claimed that in the course of addressing the issue, Bottou came up with a variant 


of GANs, which he named: Wasserstein GAN (WGAN for short). 


Outline Supporting these claims form the contents of this section. We will cover 
the following three stuffs. First, we will figure out what the critical issue is. We will 
then investigate how Bottou addressed the issue. Lastly we will discuss how Bottou’s 
way leads to the optimization problem for WGAN. 


What is the critical issue that arises in the GAN optimization? Recall in 
the GAN optimization that the optimal generator G*(-) reads: 


G* = E a (3.181) 
where Qy and Q» indicate the empirical distributions of real samples Y € 


y,...,y} =: YV and fake samples Y € (,..., 7} =: P respectively, 
and 


W 
Qr(z) Q(z) 
2 Qr@) log gy THA g t Or @) los gor oa) 
P 7 


Here z indicates a dummy variable which takes an element either from Y or from 
V. One key observation is that in almost all practically-relevant settings, fake and 
real samples are different with each other: 


YnyY =ð. (3.183) 
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Why? Consider an image data setting in which the dimension of data is usually very 
large, e.g., the dimension of an image in ImageNet is: 227 x 227 x 3 = 154, 587. 
In this setting, the probability that any fake image is exactly the same as one of the 
real images is almost 0, so it is a measure-zero event. This (3.183) together with the 
definitions of Qy and Qọ (that we established in Section 3.7) then yields: 


1 
m —2), ifzey: 


Qy(z)+Q;(z) . ^ 
A ta =o ifz e€ y; 
A 
(3.184) 
a = 0, if z E y; 
=+0 
Qe — |" 
QOD e) £ : A 
M ai ra =2, ifzey. 
—" 
Plugging this into (3.182), we get: 
eee 
Qr(2) Qtee) 
= = Le log Toa + EO log sro TOt 
zE ee ~ (3.185) 
Li = Ero log2 + = EGO log 2 
2 ey on 
= log2 


where the last equality comes from >< Qy(z) = 1 and 97-5 Q7(2). 


From (3.185), we can now see the critical issue: 
JSD(Qy, Qp) is irrelevant of how we choose G(-), 
meaning that 


G* = arg me JSD(Qy, Q) = arg an log2 could be anything. (3.186) 


This implies that we may arrive at a stupid solution from the JSD-based optimiza- 
tion (3.181), since any G(-) can be optimal. 

Here you may see that something weird is happening. Why? We already knew 
that GANs are working well in practice. This suggests that the phenomena observed 
by many researchers look inconsistent with the theory due to the above simple 
derivation (3.186). Any mistake in the above derivation? Or something wrong in 
simulations done by many practitioners? Or something else? It turns out the answer 
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is “something else”. Remember in the GAN optimization (3.180) that G(-) and 
D(-) should be DNN- based functions, not arbitrary functions. So precisely speaking, 
the optimal generator should read: 


m 


1 ; ; 
o o . = © -DGO 
Goan = arg min, ae 2 log DQ’) + log — DG"). 


In practice, DNN is not perfectly expressive, and hence: 
Géan FG. 


This is the reason why there is inconsistency between the theory and the practice. 
There are some groups of people (including Prof. Sanjeev Arora at Princeton, a theo- 
retical computer scientist) who have been investigating why the GANs with DNN- 
function constraints lead to good performances (Arora et al., 2017). Nonetheless, 
we have no clear understanding on this. 


Motivated the use of the Wasserstein distance The critical issue, reflected 
in (3.186), motivated Bottou to reconsider the generic divergence-based optimiza- 
tion problem: 


a D(Qy, Qs) (3.187) 


where D(., -) is of our design choice. He knew that there are some good divergence 
measures which do not yield the critical issue (3.186). One of the measures that he 
chose was the 1st order Wasserstein distance that we studied in Part I. This led him 
to obtain: 


a W (Qy; Qy) (3.188) 


where 


WQr, Qp) = min EY — Yi 


= min DD 7,5 IM ly? — FI. 


YF i=1 j=l 


(3.189) 


Notice that || jy — 50 || placed inside the doubled summation (marked in blue) 
depends on the values of {j}”, themselves. Hence, we can readily see that the 
objective is indeed a function of G(-) which directly controls {jf}. 
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How to solve the Wasserstein-distance-based optimization? Replacing 
D(-,-) with the Wasserstein distance in (3.187), we can rewrite the optimization 
problem (3.188) as: 


m m 
n min > > Qy pOJ -JPI : 


mi 
GC) Qy i=l j=l 


DQ) 769,57) = Qrg®), ie{1,...,m}; (3.190) 


j= 
> Oy 7995) =Q), je f{l,...,m}. 
ial 


Consider the inner optimization problem in (3.190). This is the problem 
that we are familiar with. That is an LP, We know how to solve an LP using 
the simplex algorithm. Then, no problem? Unfortunately, that is not the case. 
There are some issues in solving the above problem. We encounter two major 
issues. 

First, there is no closed-form solution for LP, so this gives a challenge in find- 
ing G*(-) in the end. The second issue is a more critical one. Notice in practice 
that the number of optimization variables, which are m? of Qy 9,5), in the 
inner problem is huge. In the big data era, m is typically an order of more than 
thousands or million, or even billion. So m? is typically a huge number. Even if we 
use a fast algorithm, like the simplex algorithm, it would take long time. So it is 
computationally expensive. 

Bottou recognized the issues. More importantly, he knew how to address them. 
The idea is to rely on the father of LP: Kantorovich. In fact, Kantorovich already 
established the strong duality theorem for the Wasserstein-distance-based LP, called 
Kantorovich duality or Kantorovich-Rubinstein duality (Villani, 2009). What 
Kantorovich showed is that the dual problem of the Wasserstein-distance-based 
LP yields exactly the same solution as that of the primal problem,’ and more 
importantly, the dual problem is computationally much more efficient. So he sim- 
ply applied Kantorovich duality to come up with an optimization problem, which 
is now known as the WGAN optimization. We will describe Kantorovich duality 
in detail below. 


7. This is what we already know. But at that time, the strong duality theorem for convex optimization was 
not established yet. In fact, Kantorovich duality formed the basis of the strong duality theorem for generic 
convex problems. 
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Notational simplification Let us consider the inner optimization problem 
in (3.190). For notational simplification, we employ the dummy variable z that we 
introduced earlier: z € V U J. Let us use z and ĉ to indicate y and 5, respec- 
tively. Using this notation, we can then rewrite the inner optimization problem as: 


min "2 > QG 2l- 2l : 


Y eyUĴ zeyuð 

>, Qy,7@2) = Qr@) vee VUD; SDi 
ZEYUY 

> MD =Q@ vee VuY. 
zeyuy 


For further notational simplification, let 
x(z, 2) := Qy 72) > 0. (3.192) 


Then, the problem can be rewritten as: 


min ` > lz — Zl|x(z, 2) : 


iba zEVUY Z€YUY 
—x(z,2) <0, Yz,2że VU; 
>. x(2,2) = Qr() Vz e VUD; (3.193) 
ZzEVYUY 
D D=O vee yuy. 
zeYUY 


Lagrange function, dual function & dual problem In an effort to derive 
the dual problem, we first consider the Lagrange function: 


LX Ave= >) 2, Ie-Zllx@2) 


ze yU że yuð 


> PA A(z, Z)x(z, 2) 


zeyUŤ ZEVYUY 


+ È @(Q&- > xed 
zeyuUð zeyuĵ 
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+ >) O(G- Do +d 


ZeYUY zeYUY 
=> Dd (le-4-264 - ve) - 1@) x9) 
zeyUŤ ze yuð 
+ >) @Qr@%+ >) AGA (3.194) 
zeYUY 2eYUD 


where (A,v, u) are Lagrange multipliers. Notice that the multiplication factors 
associated with x(z, Z)’s in the double summation term in the above last equation 
(marked in red) should be zeros: 


lz- ĉl- A(z.) —v(z) —u(%) =0 Vz,Ze VU. (3.195) 


Otherwise, one can set x(z,Z) = 00 (or —0o) depending on the sign of a non- 
zero such term while setting x(z,2) = 0 for the other terms. This then yields 
L(x, A, Vv, 4) = —00, and hence g(/, v, u) = —oo. Obviously this is not an inter- 
ested case. Hence, applying (3.195), we derive the dual function as: 


g.v.n)= >> v@)Qr@+ >) O. (3.196) 
zeyuy ZEVUY 


Now notice that A(z, 2) > 0 is a constraint that appears in the dual problem. This 
together with (3.195) then yields: 


lz- Z| v(e) - u = Alz, 23> 0 VWezZeVUY. (3.197) 


Using this, we can formulate the dual problem as: 


d* = max D, Ort D, OGA: 
| zeyuy se YUY (3.198) 


v(z) + uÈ) < lz- 2l Vez VUY. 


How to deal with two functions in the optimization? How to solve the dual 
problem (3.198)? Actually it is not that simple. One may think of the following 
native approach: Searching for all the possible functions of v(-) and u (-) in finding 
the maximum. 

Kantorovich did not take this approach. Instead he came up with a very interest- 
ing and smart idea. The idea is to translate the problem (3.198) with two functions 
(v(-) and u(-)) that one can control over, into an equivalent problem but with only 
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one function, say y(-). It turns out the idea led Kantorovich to come up with the 
following equivalent problem: 


d*:=max > Qy@yv@- >) GOO: 
" geyuy aeYUD (3.199) 


lw(z) — y@| < lz-4l| Vuze VUY. 


Notice in the translated problem that we have only one function to optimize over, 
so the complexity is significantly reduced. 

Relying on the proof of d* = d** together with the outer optimization problem 
(w.r.t. G(-)), Bottou was able to derive a simpler form of an optimization problem, 
which is now known as the WGAN optimization. 


Look ahead In the next section, we will prove that d* is indeed d**. We will 
then demonstrate that this proof leads to the WGAN optimization. 
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3.10 Wasserstein GAN II 


Recap In the previous section, we figured out there is a critical issue in GANs: 
JSD(Qy, Q;,) is irrelevant of G(-), which in turn suggests that the optimal G* 
could be anything — this is definitely not what we want. In an effort to address this 
issue, we considered the 1st-order Wasserstein distance which does not have the 
undesirable property: 


es W(Qry, Qp) (3.200) 


where Qy and Q» indicate the empirical distributions of real samples Y € 
fy), ...,y} =: Y and fake samples È € (,..., 5} =: P, respectively. 
We then checked that the objective function in (3.200) is a sensitive function of 
GC). 

Since the inner optimization in (3.200) involves so many optimization variables 
(whose number scales at m?) and hence is not computationally tractable (which is 
often the case in the current big data era), we started considering the dual problem 
which is known to be computationally tractable due to Kantorovich duality. Apply- 
ing a bunch of dual problem tricks together with introducing (z, 2) notations and 
the extended set Y := VU y notation, we expressed the dual problem as: 


d* = max X, Qr@r@)+ X, GOO: 
f yu’ žeyuð (3.201) 
v(z) + uÈ < lz- 2l VeZeVUY 
where v (z) and u (z) denote Lagrange multipliers w.r.t. the marginal-distribution- 
associated equality constraints, one for Qy and the other for Qg. 


At the end of the last section, we claimed that the above problem is equivalent 
to the following simpler optimization problem containing only one function y(-): 


d™* = max 2 oreve- | UOve@: 
zeyUŤ ZeYUY (3.202) 
lv(z) -— y ÈI < lz- êl VezeVUY. 


Also we mentioned that this simpler optimization leads to the WGAN 
optimization. 


Outline In this section, we will prove the above claim and then derive the WGAN 
optimization accordingly. What we are going to cover are three folded. We will first 
prove d* = d**. We will then use the claim to derive an optimization for WGAN. 
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Finally we will discuss on the optimality of the Wasserstein distance for divergence- 
measure based optimization problems. 


Proof of d* > d** We will show the following two inequalities: d** < d* and 
d* < d**. First let us prove the former. Consider: 


d*:=max >) Qr@y@- >) Wave 
r zeyuUŤ zeyuð 


< max max y Qy(z)w(z) + 5 Q Eu) 


Yy oH P 7 
zeyUÝ ZEeyUY 


=maxmax Š, Qr@v@+ X, GOA (3.203) 
zeyuð zeyuy 


where the inequality follows from the fact that — y (2) (next to Qs (2) in the first 
equation) can be interpreted as a particular choice among general functions repre- 
sented by u (-); and the last equality comes from a notational change: y(-) > v(-). 

On the other hand, with the notational changes w.r.t. functions (—wy (2) > 
H(z) and y (z) > v(z)), one can represent the constraint in (3.202) as: 


lv(z) + u@| < lle-Zl| VezZe PUY. (3.204) 
Since the RHS in the above is non-negative, the constraint implies that: 

v(z) + u(2) < |lz—Zl| Vez eVUY. (3.205) 
This constraint (3.205) definitely yields a larger search space relative to (3.204), 


since (3.205) does not necessarily imply (3.204). Since the above optimization is 
about maximization, the larger search space gives: 


d** < maxmax X | Qy(@)v(@) + X QOG) : 
zey zey (3.206) 


v(z) + uÈ < lz- 2l V22eY. 


Notice that the RHS in the above is exactly the same as the original dual prob- 
lem (3.201) with the optimal value d*. Hence, this proves: 


a < d*. (3.207) 
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Proof of d* < d** Let us start by recalling the original dual problem (3.201): 


d* = max X | Oy@)r(@) +X GOG: 
ze ae) (3.208) 


v(z) + uw) < lz- 2l Vez ey. 


By massaging the constraint in the above dual problem, we can relate it to the 
constraint in the translated problem (3.202). To see this, let us first move the v(z) 
term in the constraint to the RHS. This then yields: 


uÈ) < —v() + |z — zll. (3.209) 
Since this holds for all z, 2 € Y, we obtain: 
uz) < pate) + |lz — ll}. (3.210) 
Next we define the RHS in the above as — w (2): 
—y (2) := ee + |lz—2\l}. (3.211) 
This definition then yields: 
u(z) < -yÈ Vee y. (3.212) 


On the other hand, in (3.211), consider a particular choice for z as 2. This then 
gives: 


—w(%) <—v(2) Vey. (3.213) 
With a notational change in the above from 2 to z, we get: 
víz) < y(z) Yze€ y. (3.214) 


Applying the derived inequalities (3.214) and (3.212) into the original dual 
problem (3.208), we get: 


d° < max X Qreve) -X 0 OVO. (3.215) 
zey zey 
This coincides with the objective function in the other interested optimiza- 
tion (3.202). 
The next is to figure out how the constraint |y (z2) — w(z)| < ||z — 2|| in the 
translated optimization (3.202) comes up. It turns out that the definition (3.211) 
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incurs the constraint: |y(z) — w(2)| < liz — ll, Vz, 2 € V. To see this clearly, 
consider: 


y (z) ee lz — zll} Vz € y; 


i (3.216) 
—y (2) = -a +I- 2l} vè evy. 


This comes simply from the definition (3.211). Adding the above two, we get: 
Yz,z e y, 
y (z) — y2) = max{v(¢) — llt — zll} + min{—v e) + Ie — 2l} 
tey vey 
(a) R 
< max{v(¢) — |l¢ — zll —v@ + lle — zll} (3.217) 
tey 
(9) : 
< lliz- zll 


where (a) follows from choosing ¢’ = ¢ in the minimization part; and (4) comes 
from the triangular inequality: 


lle — 2]] < lle — zli + lz — 2. 
Swapping the roles of z and 2 in (3.217), one can also get: Vz, 2 € Y, 
yÈ) — w(z) < lz- zll. (3.218) 
This together with (3.217) then yields: 
lv(@ — yI < lz- 2l Yz,2 € Y. (3.219) 


The above implied constraint definitely yields a larger search space relative to the 
definition (3.211). Applying this to (3.215), we get: 


d* < mx” Qrey e) -> QOO): 
¥ zð zey (3.220) 
IVÈ) — y(@)| < llz—4l| Yz, 2 eY. 
This completes the proof d* < d**. 


]-Lipschitz constraint Notice the constraint in (3.202). Actually this is a very 
well-known constraint in math and stats, called the 1-Lipschitz constraint. It comes 
from the definition of an 1-Lipschitz function. We say that a function f (-) is 1-Lip if 


Fai) — Ff Go)| = le xl Yx x2. (3.221) 
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An intuition of an 1-Lip function is that the function values evaluated at x; and 
x2 are not different too much as long as the distance ||x1 — x2|| between the two 
points is small, meaning that the function changes sort of smoothly over x. 

Using this definition, one can therefore say that the function y(-) in (3.202) is 
1-Lip. Applying this to (3.202), we can obtain a simpler expression for the opti- 
mization problem as: 


A ERU -5 GOVO. (3.222) 


T Tip 
zey 


Wasserstein GAN Recall the definitions of Qy and Qș-: 


Oy i, ifz e y; 

= 0, ifzeV\): 
L ifze); 

oe |* = . 
0, ifzeV\Y. 


Also in the original optimization (3.200), we have the outer minimization 
over G(-). Taking all of these into consideration, we can translate the original 
Wasserstein-distance-based optimization into: 


min max ay (y) — ~> v6. (3.223) 
Me ee 


GO) vC): 1-Lip m 4 


Obviously this is a function optimization problem. Hence, as Goodfellow did, Bot- 
tou employed neural networks for G(-) and y(-) to approximate the optimization 
problem (3.223) as: 


m 
eS ja 3.224 
Bous” vO ) 7G ) ( ) 
where N indicates a set of DNN-based functions. This is exactly the optimization 
problem for WGAN. It turns out the WGAN works pretty well. So as of now, it is 
the state of the art — many GAN variants that work best for some applications are 


based on WGAN. 


A fundamental question Recall the generic divergence-based optimization 
problem: 


oF D(Qr, Qy). 
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Obviously the WGAN optimization (3.200) belongs to the above generic problem. 
Then, one very natural question arises. Is the Wasserstein distance the best choice 
for D(-, -)? In other words, 


D*(.,-) = WC, +)? (3.225) 


A special case In an effort to address this question, a few groups including 
our group have investigated a special setting in which the optimal generator G* is 
known. The special setting is so called the Gaussian linear-generator setting wherein 
data {y}”_, follows a Gaussian distribution, say: 


y ~N(0, Ky) where Ky = UAU’, (3.226) 
and the generator is subject to a linear operation: 
7 = G(x) = Gx where G e R’™*. (3.227) 
Here one natural assumption that one can make on the distribution of x is: 
x® ~ N (0, I), (3.228) 


as this way suggests that fake samples are also Gaussian, which coincides with the 
same type of distribution as that of real samples: 


JO ~ N00, EGO) (Gx®) T = N0, GGT). (3.229) 


Fortunately, under the above Gaussian setting, the optimal G* is well-known. 
Here what it means by being optimal is in a sense of maximizing the likelihood of 
the data, as we adopted while discussing on the optimality of cross entropy loss in 
Section 3.2. It turns out the optimal G* is the one that performs Principle Com- 
ponent Analysis (PCA): 


EDOSOT] = G*G*T = Udiag(41,..., 4p 0,...,0)UT (3.230) 


where (A1,..., Ag) denote the & principal (largest) eigenvalues? of Ky = U AUT. 
In a usual setting in which $ < n, the PCA solution looks making sense. The rank 
of GG is limited by k, so it may not fully represent Ky as the rank of Ky can be n. 
In this case, what one can do for the best is to make GG" as close as possible to Ky. 
One such natural way is to take the & largest eigenvalues of Ky to form a covariance 


8. One may ask how to compute the principal eigenvalues when Ky is unknown, which is often the case in 
reality. In this case, the optimal way is to compute an empirical covariance matrix S := +7” | yOyOT 
and then to take the & largest eigenvalues of S. 
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matrix. It turns out it is the best way in a sense of maximizing the likelihood of the 
data. The proof of this will be explored in Prob 10.4. 

Now under the special Gaussian linear-generator setting, one can ask the funda- 
mental question (3.225): 


G* = Giwoean = arg min W(Qy, Qp) (3.231) 
GeR”*k 


It turns out the answer is yes. Actually the proof is not that short. So due to the 
interest of other topics, we will not prove it; if you are interested, you can consult 
with a paper (Cho and Suh, 2019). 

However, the answer holds under the special Gaussian setting. So you may won- 
der if that is the case also under general settings in which the data distribution is 
arbitrary. Unfortunately it has been unanswered. We believe this is one of the fun- 
damental and intriguing questions in the context of the GAN-based framework. 
Someone may believe that the answer depends on what distribution of data we 
consider. This may be the case, but even this was not answered. So any progress on 
this will be interested. 


Look ahead In the past two sections, we derived the WGAN optimization. In 
the next section, we will investigate how to implement the optimization problem 
via TensorFlow. 
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3.11 Wasserstein GAN: TensorFlow Implementation 


Recap In the prior section, we employed Kantorovich duality to translate the 
Wasserstein-distance-based. optimization into: 


GO #0: Lip m 4 


min 2 wy) — D y GË). (3.232) 
MO i 


Then, by parameterizing G(-) and y(-), we derived the WGAN optimization: 


id os © 3.233 
Beyou” yo ) v6 ) ( ) 


where N indicates a set of DNN-based functions. We also discussed on the opti- 
mality of WGAN under a simple Gaussian linear-generator setting. 


Outline In this section, we will study how to solve the optimization (3.233) as 
well as how to implement it via TensorFlow. Specifically we are going to cover 
three stuffs. First we will investigate a practical method to respect the 1-Lipschitz 
constraint that appears in the design of y (-) in (3.233). Next, we will explore imple- 
mentation details that the WGAN paper (Arjovsky et al., 2017) introduced, regard- 
ing optimizers, neural network architecture and activation functions. Lastly, we 
will study TensorFlow implementation in the context of the same task considered 
in the GAN implementation: MNIST style handwritten digit image generation 
(see Fig. 3.40). 


A practical method for ensuring the 1-Lipschitz constraint Under the 
neural network architecture, it is difficult to fully respect the 1-Lipschitz constraint. 
Hence, the WGAN paper came up with sort of a heuristic for satisfying the con- 
straint. The heuristic is based on the following observation: a small range of model 


MNIST-like 
fake image 


A random 


° ——| Generator 
signal A. 


t 


MNIST dataset 


Figure 3.40. MNIST-style handwritten digit image generation. 
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parameters yields a small variation of the neural network function, thus encourag- 
ing the 1-Lipschitz continuity. So the method confines the values of parameter into 
a small range, say [—c, c] where c is a certain positive value. For instance, c was set to 
0.01 in the WGAN paper. This method is called weight clipping and its operation 
reads: 


=i ifw < —c; 
w= ł{w, ifwe[-c cl]; (3.234) 
c ifw>c. 
We will employ this for our implementation. 


RMSprop optimizer (Hinton et al., 2012) Recall the Adam optimizer that we 
learned in Section 3.5. The weights w® therein are updated as per: 


ger apg (3.235) 


where m") indicates a weighted average of the current and past gradients and s® is a 
normalization factor that makes the effect of the gradient V/ (w) almost constant 
over t. The WGAN paper employs a simpler version of Adam that takes only care 
of the normalization factor s while ignoring the momentum m). The simpler 
optimizer is also a famous one, named RMSprop, and it had been widely used until 
Adam came around. What the WGAN paper found is that RMSprop is enough 
to achieve good performances. So in our implementation, we will use this simpler 
version. Details on the weight update in RMSprop are given below: 


etd = 0 — g Le 


: (3.236) 
VO +e 
Here the normalization factor s® is updated according to: 
s® = pD — (1 — BVI) (3.237) 


where £ € [0, 1] denotes a hyperparameter that captures the weight of past values, 
typically set to 0.9. In our implementation, we will employ the same hyperparam- 
eters as in the WGAN paper: the learning rate a = 0.00005; and 6 = 0.9. 


Leaky ReLU We mentioned several times that the default choice for activation 
functions at hidden layers is ReLU. Actually there are many ReLU variants used in 
practice. One such variant is “leaky ReLU” and it is the one that the WGAN paper 
employed. The operation is very similar to that of ReLU. The only distinction is 
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that the output is a scaled version of the input for negative values: 


“c= [i t220 (3.238) 


Aslope2, ifz <0 


where Gsiope is a hyperparameter indicating a small slope applied for a negative 
input. Notice that Asiope = 0 gives ReLU while Gsiope = 1 yields linear activation. 
Hence, one can view this as a generalized version of ReLU. In our implementation, 
we will employ Gsiope = 0.2. 


Convolutional neural networks (CNNs) The WGAN paper uses a certain 
type of DNNs, specialized for image data. That is, convolutional neural networks, 
CNNs for short. So we will also utilize this in our implementation. Since there are 
lots of stuffs for studying CNNs, here we explain some key features of CNNs in a 
brief manner and then present the architectures of the CNN-based models to be 
employed in our implementation. 

CNNs consist of two building blocks. The first is the conv layer. The key fea- 
ture of the conv layer is that it is sparsely connected with input values. The sparse- 
connectivity feature comes from an interesting scientific finding w.r.t. visual neu- 
rons of intelligent beings: visual neurons react only to a limited region of an image 
(not the entire region). The second building block is the pooling layer. The role 
of pooling is not inspired by how visual neurons work. Rather it is mainly for 
implementation. The role is to downsample signals in an effort to reduce the com- 
putational load and the memory size. 


Model for generator Asa generator model, we employ a 5-layer CNN with four 
hidden layers, as depicted in Fig. 3.41. For activation at each hidden layer, we utilize 
leaky ReLU (marked in light blue). As in the GAN implementation (in Section 3.8), 
we also use Batch Normalization (BN), marked in green, prior to activation at the 
second and third hidden layers. Since an MNIST image consists of 28-by-28 pixels, 
the output layer has 28-by-28 neurons spread in the 2-dimensional space, and we 
use tanh activation (a shifted version of logistic activation) to ensure the range of 
[—1, 1]. 

As the first hidden layer, we employ a fully-connected dense layer with 6272(= 
7 x7 x 128) neurons. It is then reshaped into a 3D tensor with a size of 7 x 7 x 128. 
Next, we upsample it to have an expanded tensor of size 14 x 14 x 128.” Again 
we upsample it to yield another expanded tensor of size 28 x 28 x 128. Lastly we 
have a conv layer to output 28-by-28 sized 2D neurons. 


9. Explanation for its detailed operations is omitted here. We only present its role and will leave TensorFlow 
implementation in the sequel. 
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7*7*128 = 14*14*128 


28*28*128 28*28 
BN 


Leaky ReLU tanh 


Figure 3.41. Generator: A 5-layer CNN where the input size (the dimension of a latent 
signal) is 50; we use a 6272-sized dense layer for the first hidden layer; and we utilize 
conv layers for the remaining layers. The role of the second and third hidden layers is to 
upsample input to yield an expanded 3D tensor (e.g., from 7 x 7 x 128 to 14 x 14 x 128 in 
the second hidden layer). We employ leaky ReLU activation for every hidden layer, and 
tanh activation for the output layer to ensure —1-to+1 output signals. 


| 7*7*64 | 
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28*28 14*14*64 
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Figure 3.42. Critic: A 4-layer CNN where the input size (the dimension of a real or fake 
image) is 28 x 28; we use conv layers for the first and second hidden layers; and we 
utilize a dense layer in the last layer. The role of the conv layers here is to downsample 
input unlike generator. We employ leaky ReLU activation for every hidden layer, and linear 
activation for the output layer. 


Model for critic Instead of using a discriminator to classify generated images as 
being real or fake, WGAN replaces the discriminator with a critic that scores the 
realness or fakeness of a given image. So we call it critic here. As a model for critic, we 
use a 4-layer CNN with three hidden layers, illustrated in Fig. 3.42. Here the input 
size must be the same as that of a real (or fake) image, so it should read 28 x 28. In 
the first hidden layer, unlike the generator operation, we downsample the input to 
yield a shrinked map of size 14 x 14. We generate 64 different maps independently. 
We then stack all of them to construct a 3D tensor of size 14 x 14 x 64. In the 
next layer, we perform a similar operation to generate another 3D tensor of size 
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7 x 7 x 64. It is then flattened to form a vector of size 3136(= 7 x 7 x 64). Lastly, 
we have a fully-connected layer to output a single neuron with linear activation 
(no activation). 


TensorFlow: Model for generator TensorFlow implementation for the gen- 
erator model described in Fig. 3.41 is given below. 


from tensorflow.keras.models import Sequential 

from tensorflow.keras.layers import Dense 

from tensorflow.keras.layers import Reshape 

from tensorflow.keras.layers import BatchNormalization 
from tensorflow.keras.layers import Conv2D 

from tensorflow.keras.layers import Conv2DTranspose 
from tensorflow.keras.layers import ReLU 

from tensorflow.keras.layers import LeakyReLU 

from tensorflow.keras.initializers import RandomNormal 


latent_dim = 50 
# weight initialization 
init = RandomNormal(stddev=0.02) 


generator=Sequentiald 

generator.add(Dense(128* 7*7,input_dim=latent_dim)) 

generator.add(LeakyReLU(O.2)) 

generator.add(Reshape((7,7,128))) 

# upsample to 14*14 

generator.add(Conv2DTranspose(128, (4,4), strides=(2,2), 
padding=’same’, kernel_initializer = init)) 

generator.add(BatchNormalizationQ) 

generator.add(LeakyReLU(O.2)) 

# upsample to 28*28 

generator.add(Conv2DTranspose(128, (4,4), strides=(2,2), 
padding=’same’, kernel_initializer = init)) 

generator.add(BatchNormalizationO) 

generator.add(LeakyReLU(O.2)) 

# output 28*28*1 

generator.add(Conv2D(\, (7,7), activation=’tanh’, 
padding=’same’, kernel_initializer = init)) 


Here we use a built-in class Conv2DTranspose for the purpose of upsampling. It has 
a couple of input arguments concerning strides, padding and kernel_initializer. 
These are all hyperparameters subject to our design choice. We omit all the details 
in order not to distract you. Remember that these are just particular choices. 


TensorFlow: Model for critic Unlike generator, the critic model intends 
to respect the 1-Lipschitz constraint. As mentioned earlier, to this end, we 
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employ weight clipping. Specifically we rely upon the following class for code 
implementation: 


from tensorflow.keras import backend 
from tensorflow.keras.constraints import Constraint 


# clip model weights to a given hypercube 
class ClipConstraintC@Constraint): 
# set clip value when initialized 
def __ init__(self, clip_value): 
self.clip_value = clip_value 
# clip model weights to hypercube 
def __call__(self, weights): 
return backend.clip(weights, 
-self.clip_value, self.clip_value) 
def get_config(self): 
return {’clip_value’: self.clip_value} 


We use a built-in function backend.clip to implement weight clipping (3.234). 
This can then be employed to apply weight clipping in the design of a critic model. 
See below for TensorFlow implementation. 


from tensorflow.keras.models import Sequential 

from tensorflow.keras.layers import Flatten 

from tensorflow.keras.layers import Dense 

from tensorflow.keras.layers import Reshape 

from tensorflow.keras.layers import BatchNormalization 
from tensorflow.keras.layers import Conv2D 

from tensorflow.keras.layers import Conv2DTranspose 
from tensorflow.keras.layers import ReLU 

from tensorflow.keras.layers import LeakyReLU 

from tensorflow.keras.initializers import RandomNormal 


in_shape = (28,28,1) 

# clip value 

c = 0.01 

# weight initialization 

init = RandomNormal(stddev=0.02) 
# weight constraint 

const = ClipConstraint(c) 


critic=SequentialO 

# downsample to 14*14 

critic.add(Conv2D(64,(4,4),strides=(2,2),padding='same’, 
kernel_initializer=init, 
kernel_constraint=const, 
input_shape = in_shape)) 
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critic.add(BatchNormalizationO) 

critic.add(LeakyReLU(O.2)) 

# downsample to 7*7 

critic.add(Conv2D(64,(4,4),strides=(2,2),padding=’same’, 
kernel_initializer=init, 
kernel_constraint=const, 
input_shape = in_shape)) 

critic.add(BatchNormalizationQ) 

critic.add(LeakyReLU(O.2)) 

# scoring, linear activation 

critic.add(FlattenQ) 

critic.add(Dense(1)) 


Here we use c = 0.01 for a clipping value as in the WGAN paper, and 
ClipConstraint(c) is employed as an argument in Conv2D to apply weight clip- 
ping. 

TensorFlow: Generator input As a generator input, we use a random signal 
with the Gaussian distribution. In particular, we use: 


mie Riatent dim ~ N (0, Latent dim)- 


Here is how to generate the Gaussian random signal in TensorFlow: 


from tensorflow.random import normal 
x = normal([batch_size,latent_dim]) 


TensorFlow: Optimizers for generator & critic We use RMSprop optimizers 
with Ir=0.00005 and the default choice of £ = 0.9. We can readily implement it 
via a built-in-class RMSprop: 

from tensorflow.keras.optimizers import RMSprop 

opt = RMSprop(Ir=0.00005) 


Now how about loss functions for generator and critic? To answer this, consider 


the batch version of the WGAN optimization (3.233): 
1 ; 1 i 
i (Q) (4) 
min max — yoy”) — — Wo(Gy(x”’)). (3.239) 
In light of critic, taking the minus sign in the objective, we obtain the equivalent 
minimization problem: 


1 ; 1 
ine (Q) (Q) 
min == 9) voy) + —— D vo (Gul) 


icb icb 


critic loss 
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where the critic loss is defined as the minus version. Here is how to implement the 
critic loss: 
from tensorflow.keras import backend 
def wasserstein_loss(y_true,y_pred): 
return backend.mean(y_true*y_pred) 


For a real image, we can set y_true = -1 and take its corresponding critic output 
for y_pred. On the other hand, for a fake image, y_true = +1 and y_pred should 
read the critic output fed by the fake image. 

Since yo (y) is irrelevant of the generator weights w in (3.239), the generator 
loss is: 


1 , 
min — — o (Gu (x® ). 
in pa 7 ) 


~ 


generator loss 


Hence, we can also implement this via the class wasserstein_loss that we defined 
above. Here y_true should always be set to —1 and y_pred should be the critic 
output fed by a fake image. 

Taking all the above into consideration, we can compile generator and critic as 
below. 


from tensorflow.keras.layers import Input 

from tensorflow.keras.models import Model 

from tensorflow.keras.optimizers import Adam 
from tensorflow.keras.optimizers import RMSprop 
from tensorflow.keras import backend 


# RMSprop optimizer 
opt = RMSprop(lr=0.00005) 


# Define “Wassterstein loss” 
def wasserstein_loss(y_true,y_pred): 
return backend.mean(y_true*y_pred) 


# critic compile 
critic.compile(loss=wasserstein_loss, optimizer=opt) 


# define the GAN model with fake input and critic output 
# fix critic’s weights while training generator 
critic.trainable = False 

gan_input = Input(shape=(latent_dim,)) 
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x = generator(inputs=gan_input) 

output = critic(x) 

gan = Model(gan_input, output) 

# generator compile 
gan.compile(loss=wasserstein_loss, optimizer=opt) 


TensorFlow: Getting batches Since we use batches with batch_size, we pro- 
vide a code that segments data in the form of batches. 


import numpy as np 
def get_batches(data, batch_size): 
batches = [] 
for i in rangeCint(data.shape[O] // batch_size)): 
batch = data[i*batch_size:(i+1)*batch_size] 
batches.append(batch) 
return np.asarray(batches) 


TensorFlow: Training Putting all of the above together, we can now train gen- 
erator and critic via alternating gradient descent: 


import numpy as np 

from tensorflow.random import normal 
EPOCHS = 50 

k=2 #k:1 alternating gradient descent 
c_losses = [] 

g_losses = [] 


for epoch in range(|,EPOCHS + 1): 

# train per each batch 

np.random.shuffleCX_train) 

for i, real_imgs in enumerate(get_batches(X_train, batch_size)): 
HHHHHHEHHHHHHHHHAAAHH 
# train critic 
HHHHHHHHHHHHHHHAHAAEA 
# fake input generation 
gen_input = normal([batch_size,latent_dim]) 
# fake images 
gen_imgs = generator.predict(gen_input) 
real_imgs = real_imgs.reshape(-1,28,28,1) 
# input for critic 
c_input = np.concatenate([real_imgs,gen_imgs]) 
# label for critic 
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# first half: real (-1); second half: fake (+1) 
c_label = np.ones(2*batch_size) 
c_label[:batch_size] = -1 

# train critic 

c_loss = critic.train_on_batch(c_input,c_label) 


HHHHHHHHHHHHHARAEHHAAHH 
# train generator 
HHHHHHHHHHHHHHRAHAAAHH 
if i % k: # train once every k steps 
# fake input generation 
g_input = normal ([batch_size,latent_dim]) 
# label for fake images 
# Create inverted labels for fake images 
g_label = - np.ones(batch_size) 
# train generator 
g_loss = gan.train_on_batch(g_input, g_label) 


c_losses.append(c_loss) 
g_losses.append(g_loss) 


Look ahead So far we have investigated several applications that arise in machine 
learning, ranging from logistic regression, deep learning, unsupervised learning, 
GANs, all the way up to WGAN. From the next section, we will move onto the 


last application which concerns societal issues relevant to optimization techniques. 
That is, fair machine learning. 
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Problem Set 10 


Prob 10.1 (An issue in Goodfellow’s GAN) Let 


1 ifzey); 
Qy(2) = Í 
0, ifze y \ y; 
(3.240) 
a ifz € y; 
Q;@ = 


0, fZEV\Y, 
where Y := {y,...,y} and Y := GF, ..., 90} indicate the sets of real and 
fake samples, respectively. 


(a) In Section 3.9, we argued that in many practical settings, 
yoy =f. (3.241) 


Explain why. 
(4) Assuming (3.241), show that 


JSD(Qy, Qp) = log 2. (3.242) 


(c) In Section 3.7, we showed that the original GAN by Goodfellow can 
be translated into the JSD-based optimization problem under a certain 
assumption. Explain why the original GAN works well in practice although 
the JSD-based optimization does not guarantee a good performance due 


to (3.242). 


Prob 10.2 (Wasserstein GAN) Consider the 1st-order Wasserstein distance: 


WQr, Qp) = min E [iy - ĉi] (3.243) 


YY 


where Qy and Qp indicate the empirical distributions of real and fake samples 
defined as: 


i, ifz € V; 
Gr) = p ifzeĴ\ y; 
Swed (3.244) 
Q>() = m ifze y; . 
r 0, ifZeV\Y. 


Here Qy should respect the marginal distributions of Qy and Q». Let y := 
Yuy. 
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(a) Show that the dual problem of (3.243) can be derived as: 
max X Qy@)»@) +X QOG : 
zey zey (3.245) 
v(z) + u(Z) < lz- 2l Vz. ze. 


(b) Show that the dual problem (3.245) is equivalent to the following optimiza- 
tion: 


max D Qro e) — > Q@yv@®: 
zey zey (3.246) 
lw(z) -— Ol < |lze-Zl| Yz e yY. 
Prob 10.3 (2nd-order Wasserstein distance) Consider the 2nd-order 


Wasserstein distance, defined as: 


W2(Qy,Qy) = min E [iY ~ P17] (3.247) 


YY 


where Qy and Qs, indicate the empirical distributions of real and fake samples 
defined as in (3.244). Here Qy, y should respect the marginal distributions of Qy 


and Qs. Let Y := Y U Ò. 
(a) Show that 
W2(Qy, Oy) =E [1x1] +E [101] +2minE[—P7Y]. 


YY 


(6) Let 


T(Qy,Qy) = minE[-¥7Y]. (3.249) 
Qy¢ 
Show that the dual problem of (3.249) can be derived as: 
d* = max X Orv) + X QOG) : 
zey zey (3.250) 
v(z) + uÈ) < —z’z Yz,2eľ. 
(c) Show that the dual problem (3.250) is equivalent to the following optimiza- 
tion: 


d*™* = i D~AOv@ — 3.51) 


yw vr convex 
zey 
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where 
y (2) := max {-w@) + Te} Vee y. (3.252) 
zey 
Prob 10.4 (Gaussian linear generator) Consider a Gaussian linear-generator 
setting in which the data {y}”, follows a Gaussian distribution, say: 
y ~N(0,Ky) where Ky = UAU’, (3.253) 
and the generator is subject to a linear operation: 
JË := Ge) = Ge where G e R”™*. (3.254) 
Assume that rank(Ky) = n; k < n; and 
xO ~ N (0,7). (3:255) 


Also assume that x®’s are independent and identically distributed (i.i.d.); so 
(i) 
are ys, 


(a) Show that the distribution of 9 is: 
39 ~ N(O, GG"). (3.256) 
(b) Let K := GGT. With the eigenvalue decomposition, let us express K as: 
K=Vxvt (3.257) 


where X := diag(oj,...,0%,0,...,0) and V e R”*” indicates a unitary 
matrix. Here o;’s are eigenvalues of K. Let Kt := VE~!V™ be the pseudo- 
inverse of K where £~! is defined as: 


27! = dada, ',..., 0; ',0,...,0). (3.258) 


Consider a density function defined on the projected space w.r.t. G: 


1 
-37 K") (3.259) 


1 
V (27) |K| ia ( 2 


where ż e R” and |K| := ee oj. Express 


log liio] (3.260) 


i=1 


Ro = 


in terms of a sample covariance matrix S:= 407, yOyOF, 
(c) Derive the optimal K* = G*G*" such that the log-likelihood func- 
tion (3.260) is maximized. 
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Prob 10.5 (Uniform linear generator) Suppose Y = [Y%, Y2] T is a two- 
dimensional random vector where Y;’s are independent and identically distributed 
(i.i.d.) ~ Unif[0, 1], i.e., the probability density function fy,(y) = 1 for y e [0, 1] 
andi € {1,2}. Let X ~ Unif[0, 1], being independent of Y. Let y= lX, QXT 
where gi, g2 € R. Let fp(-) be the probability density function w.r.t. Y. A student 
claims that JSD(fy, fọ) is irrelevant of how we choose (gi, g2). Prove or disprove 
this claim. 


Prob 10.6 (True or False?) 


(a) Let fy}, and {7}, be real and fake samples in generative modeling. 
Let Y := fy,..., y} and Ý := 7, ..., 7}. Suppose |V| = |Y] = 
mand Y N Y = Ø. Then, the Jensen-Shannon divergence between the 
empirical distributions of real and fake samples is irrelevant of the generator 
function G(-), i.e., JSD(Qy, Q;,) does not change over G(-). 

(b) Consider a Gaussian linear-generator setting in which the data {y}”, 

follows a Gaussian distribution, say y ~ N (0, Ky), where y® € R” and 

Ky has full rank; the generator is subject to a linear operation 7 = Gx 

where G e R”** and x e R*. Assume that $ < n and x ~ N(0,J). 

Also assume that x®’s are independent and identically distributed (i.i.d.); 

so are y)’s, Consider a regime in which m —> 00. Then, under this setting, 

Wasserstein GAN can yield the optimal G that maximizes the likelihood 

(probability density function) of the data. 

Suppose Y = [Y, Y2]Ť is a two-dimensional random vector where Y;’s 

are iid. ~ Unif[0, 1], i.e., f(y) = 1 for y € [0,1]. Let X ~ Unif[0, 1], 

being independent of Y. Let Y= [1X5 OX 1” where gi € R. Consider 

KLD(fy, fy) and JSD(fy, fy). Both of them are invariant of the values of 


(21, 82). 


(c 


~ 
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3.12 Fair Machine Learning 


Recap During the past sections, we have explored two prominent methodologies 
for machine learning: (i) supervised learning; and (ii) unsupervised learning. The 
goal of supervised learning is to estimate the function f (-) of an interested system 
from input-output example pairs, as illustrated in Fig. 3.43(a). 

In an effort to translate a function optimization problem (a natural formulation 
of supervised learning) into a parameter-based optimization problem, we parame- 
terized the function with a deep neural network. We also investigated some com- 
mon practices adopted by many practitioners: Employing ReLU (or leaky ReLU) 
activation at hidden layers and softmax at the output layer; using cross entropy 
loss (the optimal loss function in a sense of maximum likelihood); and applying 
advanced versions of gradient descent, Adam and RMSprop optimizers. 

We also learned about unsupervised learning. We put a special emphasis on one 
famous unsupervised learning method: generative modeling, wherein the goal is to 
generate fake examples so that their distribution is as close as possible to that of real 
examples; see Fig. 3.43(0). In particular, we focused on one powerful generative 
model based on Generative Adversarial Networks (GANs), which have played a 
revolutionary role in the modern AI field. We explored its interesting connection 
to a well-known divergence measure in statistics: Jensen-Shannon divergence. We 
also studied a variant of GAN, named Wasserstein GAN, which addresses a critical 
issue that arises in Goodfellows GAN under very expressive DNN architectures. 


Moreover, we learned how to do TensorFlow implementation both for GAN and 
WGAN. 


Next application As the final application, we will explore one recent trending 
topic that arises in the modern machine learning: Fair machine learning. There are 
three reasons that we emphasize this topic. 

The first reason is motivated by the recent trend in the machine learning field. As 
machine learning becomes prevalent in our daily lives involving a widening array of 


Neural 


= Generator fake samples 
network ĝ:= fola) 


iy t 
a t m ai 
day) he {z0 
(a) (b) 
Figure 3.43. (a) Supervised learning: Learning the function f(-) of an interested system 


from input-output example pairs {(«, y)}”,; (b) Generative modeling (an unsupervised 
learning methodology): Generating fake data that resemble real data, reflected in {x® y2]. 
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Figure 3.44. Machine learning-based recidivism score predictor of the US Supreme 
Court: Black defendants were 77.3 percent more likely than white defendants to receive 
high recidivism scores. 


applications such as medicine, finance, job hiring and criminal justice, one morally 
& legally motivated need in the design of machine learning algorithms is to ensure 
fairness for disadvantageous against advantageous groups. The fairness issue has 
received a particular attention from the learning algorithm by the US Supreme 
Court that yields unbalanced recidivism (criminal reoffending) scores across dis- 
tinct races, e.g., predicts higher scores for blacks against whites (Larson et a/.); see 
Fig. 3.44. Hence, we wish to touch upon the trending & important topic for this 
book. The second is regarding an interesting connection to two contents that we 
learned in Parts I and II, respectively: (i) the regularization technique; and (ii) the 
optimality condition of convex optimization, characterized by the KKT condition. 
It turns out that these two play a key role in formulating an optimization prob- 
lem for fair machine learning algorithms. The last reason is that the optimization 
problem is closely related to the GAN optimization that we learned in the past sec- 
tions. You may see the perfect coherent sequence of applications, from supervised 
learning, GANs to fair machine learning. 


During upcoming sections Fora couple of upcoming sections, we will inves- 
tigate fair machine learning in depth. What we are going to cover are four folded. 
First off, we will figure out what fair machine learning is. We will then study two 
prominent fairness concepts that have been established in the recent literature. We 
will also formulate an optimization for fair machine learning algorithms which 
respect the fairness constraints based on the concepts. Next we will demonstrate 
that the regularization technique forms the basis of such an optimization and the 
optimization can be rewritten as the GAN optimization that we learned in the prior 
application. Lastly we will learn how to solve the optimization and implement via 
TensorFlow. In this section, we will cover the first two. 


Fair machine learning Fair machine learning is a subfield of machine learning 
that focuses on the theme of fairness. In view of the definition of machine learning, 
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fair machine learning can concretely be defined as a field of algorithms that train a 
machine so that it can perform a specific task in a fair manner. 

Like traditional machine learning, there are two methodologies for fair machine 
learning. One is fair supervised learning, wherein the goal is to develop a fair clas- 
sifier using input-output sample pairs: {(x, y)}”,. The second is the unsuper- 
vised learning counterpart. In particular, what is a proper counterpart for generative 
modeling? A natural one is: fair generative modeling in which the goal is to generate 
fairness-ensured fake data which are also realistic. For instance, we may want to gen- 
erate class-balanced generated samples even when trained with size-biased real data 
across different demographics. In this book, we will focus only on fair supervised 
learning. 


Two major fairness concepts In order to develop a fair classifier, we first need 
to understand what it means by fairness. Fairness is a terminology that arises in law 
that deals with justice. So it has a long and rich history, and there are numerous 
concepts prevalent in the field of law. We focus only on two major and promi- 
nent concepts on fairness, which have received particular attention in the modern 
machine learning field. 

The first is disparate treatment (DT). This means an unequal treatment that 
occurs directly because of some sensitive information (such as race, sex, and reli- 
gion), often called sensitive attributes in the literature. It is also called direct discrim- 
ination, since such attributes directly serve to incur discrimination. 

The second is disparate impact (DI). This means an action that adversely 
affects one group against another even with formally neutral rules wherein sensitive 
attributes are never used in classification and therefore the DT does not occur. It is 
also called indirect discrimination, since a disparate action is made indirectly through 
biased historical data. 


Criminal reoffending predictor How to design a fair classifier that respects 
the above two fairness concepts: DT and DI? For simplicity, let us explore this 
in the context of a simple yet concrete classification setting: Criminal reoffending 
prediction, wherein the task is to predict whether or not an interested individual 
with criminal records would reoffend in the near future, say within two years. This 
is indeed the classification being done by the US Supreme Court for the purpose 
of deciding parole. 


A simple setting For illustrative purpose, we consider a simplified version of 
the predictor wherein only a few information are employed for prediction. See 
Fig. 3.45. 

There are two types of data employed: (i) objective data; (ii) sensitive data (or 
called sensitive attributes). For objective data that we denote by x, we consider only 
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predictor 


sepettiv 
E 
fo) 2 yg? 


Figure 3.45. A criminal reoffending predictor. 


&> 


two features, say xı and x2. Let x; be the number of prior criminal records. Let x2 
be a criminal type, e.g., misdemeanour or felony. For sensitive data, we employ a 
different notation, say z. We consider a simple scalar and binary case in which z 
indicates a race type only among white (z = 0) and black (z = 1). Let ĵ be the 
classifier output which aims to represent the ground-truth conditional distribution 
P(y|x, z). Here y denotes the ground-truth label: y = 1 means reoffending within 
2 years; y = 0 otherwise. This is a supervised learning setup, so we are given m 
example triplets: {(x®, 2, yO}. 


How to avoid disparate treatment? Firs of all, how to deal with disparate 
treatment? Recall the DT concept: An unequal treatment directly because of sensi- 
tive attributes. Hence, in order to avoid the DT, we should ensure that the predic- 
tion should not be a function of the sensitive attribute. A mathematically precise 
expression for this is: 


P(y|x,z) = PQ|x) Vz. (3.261) 


How to ensure the above? The solution is very simple: Not using the sensitive 
attribute z at all in the prediction, as illustrated with a red-colored “x” mark in 
Fig. 3.45. Here an important thing to notice is that the sensitive attribute is offered 
as part of training data although it is not used as part of input. So z®’s can be 
employed in the design of an algorithm. 


What about disparate impact? How about for the other fairness concept: 
disparate impact? How to avoid DI? Again recall the DI concept: An action that 
adversely affects one group against another even with formally neutral rules. Actu- 
ally it is not that clear as to how to implement this mathematically. 

To gain some insights, let us investigate the precise mathematical definition of 
DI. To this end, let us introduce a few notations. Let Z be a random variable that 
indicates a sensitive attribute. For instance, consider a binary case, say Z € {0, 1}. 
Let Y be a binary hard-decision value of the predictor output Y at the middle 
threshold: Y := 1{Y > 0.5}. Observe a ratio of likelihoods of positive example 
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events Y = 1 for two cases: Z = 0 and Z = 1. 


P(Y = 1|Z = 0) 
P(Y =1|Z=1) 


(3.262) 


One natural interpretation is that a classifier is more fair when the ratio is closer 
to 1; becomes unfair if the ratio is far away from 1. The DI is quantified based on 
this, so it is defined as (Zafar et al., 2017): 


DI := nin( it a oy ues =e 2) (3.263) 
P(Y = 1|Z=1) P(Y = 1|Z =0) 


Notice that 0 < DI < 1 and the larger DI, the more fair the situation is. 


Two cases In view of the mathematical definition (3.263), reducing disparate 
impact means maximizing the mathematical quantity (3.263). Now how to design 
a classifier so as to maximize DI then? Depending on situations, the design method- 
ology can be different. To see this, think about two extreme cases. 

The first is the one in which training data is already fair: 


{(x, 29, yy, — large DI. 


In this case, a natural solution is to simply rely on a conventional classifier that aims 
to maximize prediction accuracy. Why? Because maximizing prediction accuracy 
would well respect training data, which in turn yields large DI. The second is a 
non-trivial case in which training data is far from being fair: 


(œ, 29, yO), — small DI. 


In this case, the conventional classifier would yield a small value of DI. This is indeed 
a challenging scenario where we need to take some non-trivial action for ensuring 
fairness. 

In fact, the second scenario can often occur in reality, since there could be biased 
historical records which form the basis of training data. For instance, the Supreme 
Court can make some biased decisions for blacks against whites, and these are likely 
to be employed as training data. See Fig. 3.46 for one such unfair scenario. In 
Fig. 3.46, a hollowed (or black-colored-solid) circle indicates a data point of an 
individual with white (or black) race; and the red (or blue) colored edge (ring) 
denotes the event that the interested individual reoffends (or non-reoffends) within 
two years. This is an unfair situation. Notice that for positive examples y = 1, there 
are more black-colored-solid circles than hollowed ones, meaning sort of biased 
historical records favouring whites against blacks. Similarly for negative examples 
y = 0, there are more hollowed circles relative to solid ones. 
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Figure 3.46. Visualization of a historically biased dataset: A hollowed (or black-colored- 
solid) circle indicates a data point of an individual with white (or black) race; the red (or 
blue) colored edge denotes y = 1 reoffending (or y = 0 non-reoffending) label. 


How to ensure large DI? How to ensure large DI under all possible scenarios 
including the above unfair challenging scenario? To gain insights, first recall an 
optimization problem that we formulated earlier in the design of a conventional 
classifier: 


le iy aids 
min — >? lce(y, 5) (3.264) 
oe i=1 


where €¢¢(-, +) indicates binary cross entropy loss, and w denotes weights (parame- 
ters) for a classifer. One natural approach to ensure large DI is to incorporate an DI- 
related constraint in the optimization (3.264). Maximizing DI is equivalent to min- 
imizing 1 — DI (since 0 < DI < 1). So we can resort to the regularization technique 
that we learned in Part I. That is, adding the two objectives with different weights. 


Regularized optimization Here is a regularized optimization: 


1~ ee 
min — Si bce(y, 9%) +4 A = DI) (3.265) 
i=1 


where A denote a regularization factor that balances predication accuracy against 
the Dl-associated objective (minimizing 1 — DI). However, here an issue arises in 
solving the regularized optimization (3.265). Recalling the definition of DI 


P(Y =1|Z=0) P(Y =1|Z=1) 
PY = 1|Z = 1) P(Y =1|Z = 0) 


DI := min 


we see that DI is a complicated function of w. We have no idea as to how to express 
DI in terms of w. 
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Another way Since directly expressing DI as a function of w is not doable, one 
can rely on another way to go. Another way that we will take is inspired by one 
popular information-theoretic measure: mutual information (Cover, 1999). Notice 
that DI = 1 means that the sensitive attribute Z is independent of the hard deci- 
sion Y of the prediction. One key property of mutual information is that mutual 
information between two input random variables being zero is the “sufficient and 
necessary condition” for the independence between the two inputs. This motivates 
us to represent the constraint of DI = 1 as: 


I(Z; Y) =0. (3.266) 


This captures the complete independence between Z and Y. Since the predictor 
output is Y (instead of Y), we consider another stronger condition that concerns 
Y directly: 


1(Z;Y) =0. (3.267) 


Notice that the condition (3.267) is indeed stronger than (3.266), i.e., (3.267) 
implies (3.266). This is because 


~ a Bee 
MZ: 0) < UZ:¥,% 


® 12; 9), 


(3.268) 


Here the step (4) is due to two key properties that mutual information has: (i) the 
chain rule holds for mutual information, i.e., (Z; Y, Y) =1(Z; Y) +1(Z; Y| Y); 
and (ii) mutual information is non-negative like Kullback-Leibler divergence; in 
this case, 1(Z; Y|Y) > 0. The step (b) is also because of the chain rule. To see this, 
we employ the chain rule to have: 


I(Z;Y, VY) = I(Z; Ŷ) + 1(Z; YY). 
Here 1(Z; Y|Y¥) = 0 since Ÿ is a function of Ŷ: Ý := 1{Y > 0.5}. Notice 
that (3.267) together with (3.268) gives (3.266). 


Strongly regularized optimization In summary, the condition (3.267) indeed 
enforces the DI = 1 constraint. This then motivates us to consider the following 
optimization: 


t sane x 
min — X ce, J) +4- 1 ¥). (3.269) 
w m 


i=1 
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Now the question of interest is: How to express I (Z; Y) in terms of classifier param- 
eters w? It turns out interestingly there is a way to express it. Also it is intimately 
related to the GAN optimization that we learned. 


Look ahead In the next section, we will employ the way to explicitly formulate 
an optimization for a fair classifier, and then make an interesting connection with 
the GAN optimization. 
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3.13 A Fair Classifier and Its Connection to GANs 


Recap In the previous section, we introduced the last machine learning applica- 
tion: A fair classifier. As an example of a fair classifier, we considered a recidivism 
predictor wherein the task is to predict if an interested individual with prior crim- 
inal records would reoffend within two years; see Fig. 3.47 for illustration. 

In order to avoid disparate treatment (one famous fairness notion), we made 
the sensitive attribute not included as part of the input. To address another fair- 
ness notion (disparate impact, DI for short), we introduced a regularized term into 
the conventional optimization (taking only into account prediction accuracy), thus 
arriving at: 


eae S a R 
min — X læQ®, 99) +14- IZ; Ê) (3.270) 
w m 


i=l 


where 2 > 0 is a regularization factor that balances prediction accuracy (reflected 
in cross entropy loss) against the quantified fairness constraint, reflected in I(Z; Y). 
Remember that /(Z; Y) = 0 is a sufficient condition for ensuring DI = 1. At the 
end of the last section, we then claimed that one can express the mutual information 
I(Z; Y) in terms of an optimization parameter w, thereby enabling us to train the 
model parameterized by w. We also mentioned that the expressible optimization to 
be formulated has an intimate connection to GANs. 


Outline In this section, we will support the claim. Specifically we are going 
to cover the following four stuffs. First we will explore an interesting connec- 
tion between mutual information and a well-known divergence measure that we 
introduced in Prob 8.4: the Kullback-Leibler (KL) divergence. Building upon the 
connection and applying the optimality condition of convex optimization (fully 
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Figure 3.47. A simple recidivism predictor: Predicting a recidivism score ĵ from x = (x1, x2). 
Here x, indicates the number of prior criminal records; x. denotes a criminal type (mis- 
demeanor or felony); and z is a race type among white (z = 0) and black (z = 1). 
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characterized by the KKT condition), we will show that (Z; Y) can be expressed 
in terms of a model parameter w. Next, we will translate the expressible optimiza- 
tion into an implementable form, thereby coming up with a concrete way to solve 
the optimization. Lastly we will make an analogy with GANs. 


Connection between mutual information vs. KL divergence There are 
two versions of definition for mutual information. The first is based on the Shannon 
entropy that we mentioned in Section 3.2. The second is expressed in terms of the 
KL divergence. Here we adopt the second version to explore an connection between 
mutual information and the KL divergence (Cover, 1999): 


I(Z; Ê) := KLD (Po z PpPz) . (3.271) 


If you think about it, this definition makes an intuitive sense. Notice that the 
independence between Z and Y implies Py; , = P;Pz, which in turn leads to 


KLD(P} z P;Pz) = 0, thereby I(Z; Y) = 0. 


Manipulation of (3.271) Starting with (3.271), we can express the mutual infor- 
mation as: 


IZ; 9) = KLD (Pp z PoPz) 


Po (9, 2) 
@ 5, Pe (9, 2) lo a, 
Y,ZV? 8 DE (> 
JED ,2€Z 7 Pz) 
b) Pi z0, 2) 
2 Žž, Py z0, z) log TO (3.272) 
je, 2EZ ` 
P l 
On 7 (z) 8 P z j 
> 7}, 2) 
2 > Py, Grz)log ae iaa +H(Z) 


jeð, zEZ 


where (a) is due to the definition of the KL divergence; (4) comes from the total 
probability law; and (c) is due to the definition of the Shannon entropy. 
Now define the term placed in the last line marked in blue as: 


Ps J, 
D* G, z) = ee (3.273) 
Y 
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Due to the total probability law, D*(j,z) should respect the sum-up-to-one 
constraint w.r.t. Z: 


> DGa=1 Y. (3.274) 


Mutual information via function optimization Instead of D*(y, z), one can 
think about another function, say D(y, z), which respects only the sum-up-to-one 
constraint (3.274). It turns out D* (J, z) is the optimal choice among such D(j, z) 
in a sense of maximizing: 


>) Pi 20,2) log DG, z), (3.275) 
zz 


and this gives insights into expressing 7 (Z; Y) in terms of w. To see this clearly, 
let us formally state that D* (9, z) is indeed the optimal choice via the following 
theorem. 

Theorem: The mutual information /(Z; 4 ), reflected in the last line of (3.272), 
can be represented as the following function optimization: 


1(Z;¥) = H(Z) + max Pe ,(j,z) log D(j,z). (8.276) 
(Z; Y) ) oaf oam 9.292) log Dj 


Proof: The proof relies upon what we learned in Part II: the optimality condi- 
tion for convex optimization. Notice that the optimization (3.276) is convex in 
D(.,-), since the log function is concave and the convexity preserves under addi- 
tivity. Hence, by checking the KKT condition (the optimality condition for con- 
vex optimization), one can prove that the optimal D(.,-) indeed respects (3.273) 
and (3.274). See below for details. Consider the Lagrange function: 


LDG, 2), v9) = X Pi 7G,2) log DG, 2) + $ v9) (: —)> DG, ə) 
jz j z 
(3.277) 


where v(j)’s indicate Lagrange multipliers w.r.t. the equality constraints. Consider 
the KKT condition: 


aL£(DY, z), VQ) 
dD(j, z) 


E Py 20,2) 


D=DoptsV=Vopt Dopt (y, z) 


Y Dasi: (3.279) 


= Vopt(ĵ) = 0; (3.278) 
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Ps (2 
So we get Dopt (J, z) = a Plugging this into (3.279), we obtain: 
? 2 Pp 702) 
> Doo Gs z) = m a =], (3.280) 
z op 


which yields: 


Vopt (J) = a z) = P0). (3.281) 


This together with (3.278) then gives: 

Py 20,2) - P5202) 
Vopt (7) Po) 

This completes the proof of the theorem. ll 


Dopt (J, z) = = D*(j,z). (3.282) 


How to express /(Z; Y) in terms of w? Are we done with expressing I (Z; Y) in 
terms of w? No. This is because Pẹ (jy, z) that appears in (3.276) is not available. 
To resolve this problem, we rely upon the empirical distribution instead: 


 ,\ — 1 ' 
Qr, z) ==  Vie{l,... m}. 


In practice, the empirical distribution is very likely to be uniform, since 7 is real- 
valued and hence the pair (7, z®) is unique with high probability. By parametriz- 
ing the function D(-,-) with another, say 0, we can approximate I (Z; Y) as: 


m 


I(Z;Ŷ) ~ H(Z) + log Da, z® (3.283) 
(Z; Ŷ) ~ H(Z) mR De og Da 9, 2). 


From the above parameterization building upon the function optimization (3.276), 
we can now approximately express /(Z; Y) in terms of w and 0. 


Implementable optimization (Cho et al., 2020) Notice in (3.283) that 
H(Z) is irrelevant to the introduced optimization variables (w,@). Hence, the 
mutual information (MI)-based optimization (3.270) that we started with can be 
(approximately) translated into: 


min aps lce(y®, 7) +2 5 log Dg GË, a. (3.284) 


wad, Do G.2)= 1m =] 


The objective function is a function of (w,@) and hence it is implementable, for 
instance, via famous neural networks. Since we have “min max”, we can apply the 
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variant of gradient descent that we learned in Section 2.2. That is, alternating gra- 
dient descent, in which given w, 0 is updated via the inner optimization and then 
given the updated 0, w is newly updated via the outer optimization, and this pro- 
cess iterates until it converges. 


The architecture of the fair classifier The architecture of the implementable 
optimization (3.284) is illustrated in Fig. 3.48. On top of a classifier, we introduce 
a new entity, called discriminator, which corresponds to the inner optimization. 
In discriminator, we wish to find 0* that maximizes + 7” , log Do (7, 2). On 
the other hand, the classifier wants to minimize the term. Hence, Dg(j, z) can be 
viewed as the ability to figure out z from prediction y. Notice that the classifier 
wishes to minimize the ability for the purpose of fairness, while the discriminator 
has the opposite goal. So one natural interpretation that can be made on Dg(j, z) is 
that it captures the probability that z is indeed the ground-truth sensitive attribute 
for y. Here the softmax function is applied to ensure the sum-up-to-one constraint. 


Analogy with GANs Since the classifier and the discriminator are competing, 
one can make an analogy with Goofellow’s GAN, in which the generator and the 
discriminator also compete like a two-player game. While the fair classifier and the 
GAN bear strong similarity in their nature, these two are distinct in their roles. See 
Fig. 3.49 for the detailed distinctions. 


Look ahead We are now done with the optimization formulation for a fair clas- 
sifier. In the next section, we will study how to solve the optimization (3.284), as 
well as how to implement it via TensorFlow. 


softmax 


discriminator 
0 
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Figure 3.48. The architecture of the Ml-based fair classifier. The prediction output 7 is fed 
into the discriminator wherein the goal is to figure out sensitive attribute z from 7. The 


discriminator output Dg(j,z) can be interpreted as the probability that 7 belongs to the 
attribute z. Here the softmax function is applied to ensure the sum-up-to-one constraint. 
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Mi-based fair classifier GAN 
discriminator discriminator 
Figure out sensitive attribute | Goal: Distinguish real samples 
from prediction from fake ones. 
classifier generator 


Maximize prediction accuracy | Generate realistic fake samples 


Figure 3.49. Ml-based fair classifier vs. GAN: Both bear similarity in structure (as illus- 
trated in Fig. 3.48), yet distinctions in role. 
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3.14 A Fair Classifier: TensorFlow Implementation 


Recap Previously we formulated an optimization that respects two fairness con- 
straints: disparate treatment (DT) and disparate impact (DI). Given m example 
triplets {, 29, yO): 


ee be aaa R 
min — J lee, 5) +412 ¥) 


i=1 


where 7”) indicates the classifier output, depending only on x (not on the sen- 
sitive attribute z due to the DT constraint); and / is a regularization factor that 
balances prediction accuracy against the DI constraint, quantified as /(Z; Y). Using 
the connection between mutual information and KL divergence, as well as the KKT 
condition (the optimality condition for convex optimization), we could approxi- 
mate (Z; Y ) in the form of optimization: 


m 


A 1 P ; 
l(Z: VY) x H(Z — log DG, 2), 28 
(Z: Y) Z)+ mx 2, og DY, z”) (3.285) 


PAEL ey 


We then parameterized D(-) with @ to obtain: 


w 0:>, Do(j,z)=1 ™ 


1}|< ae 7 MER 
min max — [S a0 i ¥ ns D209, 
i=1 


i=1 


(3.286) 


Two questions that arise are: (i) how to solve the optimization (3.286)?; and (ii) 
how to implement it via TensorFlow? 


Outline In this section, we will address these two questions. What we are going 
to do are four folded. First we will investigate a practical algorithm that allows 
us to attack the optimization (3.286). We will then do a case study for the pur- 
pose of exercising the algorithm. The case study is the one that we introduced 
earlier: recidivism prediction. In the process, we will put a special emphasis on 
one implementation detail: synthesizing an unfair dataset that we will use in our 
experiments. Lastly we will learn how to implement programming via TensorFlow. 
For illustrative purpose, we will focus on a simple binary sensitive attribute 
setting. 


314 Machine Learning Applications 


Observation Let us start by translating the optimization (3.286) into the one 
that is more programming-friendly: 


min 113 e(y, 7) +A J log Do(G, 2 


w OS, DiG a)=1 m l 


Q hs (5 
= € 
min max — = {> ce’, 9) 


a > log Dp G) + > log (1 - Dy G)) 


iz®=1 iz®=0 


lci (y, 5%) 
1 


m 
Ji l 
min max — 
w 0 m 


+å (È z® log Da G) + (1 — 2) log (1 — vo"))| 


i=1 


= min max — pat 6°. y e) —A > te (z k 2g) 
i=l i=1 

@ min max — pat (y © Gy (x ®)) = Ace (z © , Do (Gw 6) 
i=1 


=/(w,9) 


where (a) is because we consider a binary sensitive attribute setting and we denote 
Do 5, 1) simply by Do (7); (b) is due to z® € {0, 1}; (c) follows from the def- 
inition of binary cross entropy loss €ce(-, +); and (d) comes from Gy (x) = 50. 

Notice that /(w,0) contains two cross entropy loss terms, each being a non- 
trivial function of G,,(-) and/or Dg(-). Hence, in general, J (w, 0) is highly non- 
convex in w and non-concave in 6. 


Alternating gradient descent Similar to the prior GAN setting in Section 3.8, 
what we can do in this context is to apply the only technique that we are aware 
of: alternating gradient descent. And then hope for the best. So we employ & : 1 
alternating gradient descent: 


1. Update classifier (generator)’s weight: 


wETD ee w® — a1 Vu (w®, 0%), 
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2. Update discriminator’s weight & times while fixing wt): for i=1:k, 
Qe k+s) ya Qe kti-1) + a2 Voj] (wt), aed), 
3. Repeat the above. 


Similar to the GAN setting, one can use the Adam optimizer possibly together with 
the batch version of the algorithm. 


Optimization used in our experiments Here is the optimization that we will 
use in our experiments: 


min max — : {So —A)ece (y D, Gyle e) — ilce (z ©, De(Gy aon). 


i=1 
(3.287) 


In order to restrict the range of A into 0 < A < 1, we apply the (1 — Å) factor to 
the loss term w.r.t. prediction accuracy. 

Like the prior GAN setting, let us define two loss terms. One is “classifier (or 
generator) loss”: 


min max — apa = A)bce (9, G(x) = Ac (z 0, DGD). 
e 2 


“classifier (generator) loss" 


Given w, discriminator wishes to maximize: 
max —— = te (z O Do(Gy (x)) 


This is equivalent to minimizing the minus of the objective: 


LŽ | | 
A © © 
a læ (2 DUGG ))). (3.288) 


“discriminator loss" 
This is how we define “discriminator loss”. 
Performance metrics Unlike to the prior settings (supervised learning and 
GAN), here we need to introduce another performance metric that captures the 


degree of fairness. To this end, let us first define the hard-decision value of the 
prediction output w.r.t. a test example: 


Kess = Hiis > 0.5}. 
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The test accuracy is then defined as: 


Meest © ( 
S19. = 52, 
Mest j=1 


where mest denotes the number of test examples. This is an empirical version of 
the ground truth P(Yest = Fies). 

How to define a fairness-related performance metric? Recall the mathematical 
definition of DI: 


DI := mia( acd ax a o lin laa 2) (3.289) 
P(Y = 1|Z=1) P(Y = 1|Z =0) 


Here you may wonder how to compute two probabilities of interest: P(Y = 1|Z = 
0) and P(Y = 1|Z = 1). Using their empirical versions together with the Law of 
Large Numbers (i.e., the empirical mean converges to the true mean as the number 
of samples tends to infinity), we can estimate them. For instance, 


PY =1,Z=0) _ EZ = 122, = 0) 
P(Z = 0) pate 12? = 0} 


tost 


P(Y = 1|Z = 0) = 


where the first equality is due to the definition of conditional probability and the 
second approximation comes from the Law of Large Numbers. The above approx- 
imation is getting more and more accurate as Mrest gets larger. Similarly we can 
approximate the other interested probability P(Y = 1|Z = 1). This way, we can 
evaluate DI (3.289). 


A case study Let us exercise what we have learned so far with a simple exam- 
ple. As a case study, we consider the same simple setting that we introduced earlier: 
recidivism prediction, wherein the task is to predict if an interested individual reof- 
fends within two years, as illustrated in Fig. 3.50. 


Synthesizing an unfair dataset One thing that we need to be careful about 
in the context of fair machine learning is w.r.t. unfair datasets. For simplicity, we 
will employ a synthetic dataset, not a real-world dataset. In fact, there is a real-world 
dataset that concerns the recidivism prediction, called COMPAS (Angwin et al., 
2020). But this contains many attributes, so it is a bit complicated. Hence, we will 
take a particular yet simple method to synthesize a much simpler unfair dataset. 
Recall the visualization of an unfair data scenario that we investigated in Sec- 
tion 3.12 and will form the basis of our synthetic dataset (to be explained in the 
sequel); see Fig. 3.51 for the visualization. In Fig. 3.51, a hollowed (or black- 
colored-solid) circle indicates a data point of an individual with white (or black) 
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Recidivism 
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Figure 3.50. Predicting a recidivism score ĵ from x = (x1, x2). Here x indicates the number 
of prior criminal records; x. denotes a criminal type: misdemeanor or felony; and z is a 
race type among white (z = 0) and black (z = 1). 
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Figure 3.51. Visualization of a historically biased dataset: A hollowed (or black-colored- 
solid) circle indicates a data point of an individual with white (or black) race; the red (or 
blue) colored edge denotes y = 1 reoffending (or y = 1 non-reoffending) label. 


race; and the red (or blue) colored edge (ring) denotes the event that the interested 
individual reoffends (or non-reoffends) within two years. This is indeed an unfair 
scenario: for y = 1, there are more black-colored-solid circles than hollowed ones; 
similarly for y = 0, there are more hollowed circles relative to solid ones. 

To generate such an unfair dataset, we employ a simple method. See Fig. 3.52 
for illustration of the method. We first generate m labels y®’s so that they are inde- 
pendent and identically distributed (i.i.d.), each being according to a uniform dis- 
tribution, i.e., P(Y = 1) = P(Y = 0) = 7 We denote the uniform distribution 
by Bern( 5), since the associated random variable is known as a Bernoulli random 


variable. Here the number 4 inside the parenthesis indicates the probability of a 
Bernoulli random variable being 1. For indices of positive examples (y = 1), 
we then generate iid. x’s according to N((1, 1), 0.571); and iid. 2s as per 
Bern(0.8), meaning that 80% are blacks (z = 1) and 20% are whites (z = 0) 
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T2 


iid. yO ~ Bern(0.5) Ify® =1: 
z® ~ N((1,1), 0.570) 


(i) _ J 1 (black), w.p. 0.8; 
Z = 10 (white), w.p. 0.2. 
äi Ly 
If y® — 0 ; 
2 ~ N((—1, —1),0.5°T) 
a) _ J 1 (black), w.p. 0.2; 
Z 51 0 (white), w.p. 0.8. 


Figure 3.52. A simple way to synthesize an unfair dataset. 
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Figure 3.53. The architecture of the Ml-based fair classifier. 


among the positive individuals. Notice that the generation of x®’s is not quite real- 
istic. The first and second components in x® do not precisely capture the number 
of priors and a criminal type. You can view this generation as sort of a crude abstrac- 
tion of the realistic data. On the other hand, for negative examples (y = 0), we 
generate i.i.d. (x, 2)’s with different distributions: x® ~ M((—1, —1), 0.571) 
and z® ~ Bern(0.2), meaning that 20% are blacks (z = 1) and 80% are whites 
(z = 0). This way, 2 ~ Bern(4). This is because 


PZ = 1) Ê PY = DPZ = 1Y = 1) + P(Y = 0)P(Z = 1|Y = 0) 
(6) 1 1 1 
pa ean ee 
2 73 2 


where (a) follows from the total probability law and the definition of conditional 
probability; and (b) is due to the rule of the data generation method employed. 
Here Z and Y denote generic random variables for z and y®, respectively. 
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Figure 3.54. Models for (a) the classifier and (4) the discriminator. 


Model architecture Fig. 3.53 illustrates the architecture of the MI-based fair 
classifier. Since we focus on the binary sensitive attribute, we have a single output 
Do(y) in the discriminator. For models of the classifier and discriminator, we 
employ very simple single-layer neural networks with logistic activation in the out- 
put layer; see Fig. 3.54. 


TensorFlow: Synthesizing an unfair dataset Let us discuss how to imple- 
ment such details via TensorFlow. First consider the synthesis of an unfair dataset. 
To generate i.i.d. Bernoulli (binary) random variables for labels, we use: 


import numpy as np 
y_train = np.random.binomial(,0.5,size=(train_size,)) 


where the first two arguments of (1,0.5) specify Bern(0.5); and the null space fol- 
lowed by train_size indicates a single dimension. Remember we generate i.i.d. 
Gaussian random variables for x®’s. To this end, one can use: 


x = np.random.normal(loc=(1,1),scale=0.5, size=(train_size,2)) 


TensorFlow: Optimizers for classifier & discriminator For classifier, we use 
the Adam optimizer with the learning rate of 0.005 and (£1, £2) = (0.9, 0.999). 
For discriminator, we use another much simpler optimizer, named Stochastic Gra- 
dient Descent, SGD for short. SGD is the naive gradient descent yet with a batch 
size of 1. We use SGD with the learning rate of 0.005. 


from tensorflow.keras.optimizers import Adam 

from tensorflow.keras.optimizers import SGD 
adam=Adam(learning_rate=0.005,beta_1=0.9, beta_2=0.999) 
sgd=SGD(learning_rate=0.005) 


320 Machine Learning Applications 


TensorFlow: Classifier (generator) loss Recall the optimization problem of 


interest: 


min max — apa — Dec (y © Gu (x)) 


= Slee (29, Dutcuts))| 


i=1 


To implement the classifier loss (the objective in the above), we use: 


from tensorflow.keras.losses import BinaryCrossentropy 
CE_loss = BinaryCrossentropy(from_logits=False) 
p_loss = CE_loss(y_pred,y_train) 

f_loss = CE_loss(discriminator(y_pred),z_train) 

c_loss = (1-lamb)*p_loss - lamb*f_loss 


where y_pred indicates the classifier output; y_train denotes a label; and z train is a 
binary sensitive attribute. 


TensorFlow: Discriminator loss Recall the Discriminator loss that we defined 
in (3.288): 


min : Sie (29, Do(Gue))) . 
i=1 


To implement this, we use: 


f_loss = CE_loss(discriminator(y_pred),z_train) 
d_loss = lamb*f_loss 


TensorFlow: Evaluation Recall the DI performance: 


plex min( E= 1Z=9 PY =1Z=1 
= PY =1|Z =1) PE =1|Z=0) ) 


To evaluate the DI performance, we rely on the following approximation: 


DEN Ge = L zer = 0} 


P(Y = 1|Z =0) 7% — - 
ET 
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Here is how to implement this in detail: 


import numpy as np 

y_tilde = (y_pred>0.5).intQ.squeezeQ 

zO_ind = (z_train == 0.0) 

Z1_ind = (z_train == 1.0) 

zO_sum = int(np.sum(zO_ind)) 

Z1_sum = int(np.sum(z1_ind)) 

P_y1_zO = float(np.sum((y_tilde==1)[ZO_ind]))/zO_sum 
P_y1_z1 = float(np.sum((y_tilde==1)[z1_ind]))/z1_sum 


Closing Let us conclude the book. In Part I, we investigated several instances of 
convex optimization problems, ranging from LP, Least Squares, QP, SOCP, and all 
the way up to SDP. We studied how such problems are categorized, as well as how 
to formulate some real-world problems into such specialized problems via some 
translation techniques possibly aided by matrix-vector notations. For some certain 
settings including LP, unconstrained optimization and equality-constrained QP, we 
also studied how to solve the problems explicitly. 

In Part II, we studied two important theorems: (1) strong duality theorem; (2) 
weak duality theorem. With the strong duality theorem, we came up with a generic 
algorithm which provides detailed guidelines as to how to solve arbitrary convex 
optimization problems: the interior point method. With the weak duality theorem, 
we investigated a certain yet powerful method, called Lagrange relaxation, which 
can provide reasonably-good approximation solutions for a variety of non-convex 
problems. 

In Part III, we explored one recent killer application where optimization tools 
that we learned play central roles: Machine learning. In particular, we explored two 
certain yet popular methodologies of machine learning: (1) supervised learning; 
and (2) unsupervised learning. For supervised learning, we put an emphasis on 
deep learning, which is based on deep neural network architectures which received 
significant attention recently. We found that the optimization tools and concepts 
that we learned are instrumental particularly in choosing objective functions as well 
as gaining algorithmic insights. As for unsupervised learning, we investigated the 
most fundamental learning method, called generative modeling, and then studied 
one specific yet powerful framework for generative modeling, named GANSs. In 
this context, we observed that the duality theorems play a crucial role in enabling 
a practical implementation for the state-of-the-art GAN, which is WGAN. As the 
last application, we explored fair machine learning to demonstrate the power of the 
regularization technique and the KKT condition. 
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It is no doubt that tools for convex optimization are very powerful. The use- 
fulness has already been proved by many researchers working on a wide variety of 
fields. While this book puts an emphasis on a particular application (machine learn- 
ing), it is shown to have much broader applicability. So we hope you would find all 
of these useful in your own research field. 
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Problem Set 11 


Prob 11.1 (Equalized Odds) In Section 3.12, we studied two fairness concepts: 
(i) disparate treatment; and (ii) disparate impact. In this problem, we explore 
another prominent fairness notion that arises in the field: Equalized Odds (EO for 
short) (Hardt et al., 2016). Let Z € Z bea sensitive attribute. Let Y and Y be the 
ground-truth label and its prediction. 


(a) 


(b) 


(c) 


For illustrative purpose, let us investigate a simple setting where Z and Y 
are binary. Let Y = 1{Y > 0.5}. For this setting, EO is defined as: 


P(Y =1|Y=yZ=1- 
EO:= min min ES ie 


= (3.290) 
ye(01} 2€(0,1} P(Y = 1|Y =y,Z =z) 


Show that /(Z; Y|Y) = 0 implies EO = 1. 
Suppose now that Z and Y are not necessarily binary. The conditional 
mutual information is defined as: 


I(Z; Î| Yj:= KLD(P$ zy» Py, yPzy) 


where Py zy Pry and Pzjy indicate the conditional probability of 
(Y,Z), Y, and Z, respectively, given Y. Using this definition, show that 


Ps zy M2) 


——— +A(Z|Y 
E (ZIY) 


I(Z; Y\Y) = > P 7 yQ, z y) log 
yEV VEY, 2EZ 
(3.291) 


where Ps 


YZY 
Poy denote the conditional distributions of (Y,Z) and Î, respectively, 
given Y= y. 


Show that 


indicates the joint distribution of (Ê, Z, Y); and P PZy and 


I(Z;Y|Y) = H(Z|Y) 


+ . max 
DG23): 2z z DVZ)=1 (3.292) 


a > Py z yQ. zy) log DQ, z, y). 
JED yEV,2€Z 
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(d) Explain the rationale behind the following approximation: 


I(Z; ÎIY) ~ H(ZIY) 


m] E (3.293) 
ae > È log DG, 2), ), 
DG.2,9): Deez DVzy)=1 L M 


(e) Formulate an optimization for a fair classifier that attempts to mini- 
mize both predication accuracy and the approximated /(Z; Y|Y), reflected 
in (3.293). Use a notation À for a regularization factor that balances pre- 
diction accuracy against the quantified fairness constraint. Also draw the 
classifier-8¢-discriminator architecture which respects the formulated opti- 
mization. 


Prob 11.2 (A variant of the Ml-based fair classifier) Let Z e {0,1} be a 
binary sensitive attribute. Let Y and Y be the ground-truth label and its prediction 
of a classifier. Let Y = 1{Y > 0.5}. 


(a) Show that /(Z; Y) = O isa necessary and sufficient condition for DI = 1. 

(b) Approximate I (Z; Y) as claimed in part (d) in Prob 11.1. Also explain the 
rationale behind the approximation. 

(c) Formulate an optimization for a fair classifier that attempts to minimize 
both prediction accuracy and the approximated /(Z; Y), done in the prior 
part. Use a notation 4 for a regularization factor that balances predic- 
tion accuracy against the fairness constraint. Also draw the classifier-&- 
discriminator architecture which respects the formulated optimization. 


Prob 11.3 (TensorFlow implementation of the Ml-based fair classifier) 
Consider the MI-based fair classifier that we learned in Sections 3.13 and 3.14. In 
this problem, you are asked to build a simple fair classifier that predicts recidivism 
scores of individuals with prior criminal records. See Fig. 3.55. We employ very 
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Figure 3.55. Predicting a recidivism score ĵ from x = (x1, x2). Here xı indicates the number 


of prior criminal records; x. denotes a criminal type: misdemeanor or felony; and z is a 
race type among white (z = 0) and black (z = 1). 


> 
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simple single-layer neural networks for classifier (generator) and discriminator with 
logistic activation in the output layer. 


(a) (Unfair dataset synthesis) Explain how an unfair dataset is generated in the 
following code: 


import numpy as np 

n_samples = 2000 

p = 0.8 

# numbers of positive and negative examples 

n_Y1 = int(n_samples*0.5) 

n_YO = n_samples - n_Y1 

# generate positive samples 

Y1 = np.ones(n_Y1) 

X1 = np.random.normal(loc=[1,1],scale=0.5, 
size=(n_Y1,2)) 

Z1 = np.random.binomial(|,p,size=(n_Y1,)) 

# generate negative samples 

YO = np.zeros(n_YO) 

XO = np.random.normal(loc=[-1,-1],scale=0.5, 
size=(n_YO,2)) 

ZO = np.random.binomial(1,1-p,size=(n_YO,)) 

# merge 

Y = np.concatenate((Y1,YO)) 

X = np.concatenate((X1,xXO)) 

Z = np.concatenate((Z1,Z0)) 

Y = Y.astype(np.float32) 

X = X.astype(np.float32) 

Z = Z.astype(np.float32) 

# shuffle and split into train & test data 

shuffle = np.random.permutation(n_samples) 

X_train = X[shuffle][:int(n_samples*0.8)] 

Y_train = Y[shuffle][:int(n_samples*0.8)] 

Z_train = Z[shuffle][:int(n_samples*0.8)] 

X_test = X[shuffle][int(n_samples*0O.8):] 

Y_test = X[shuffle]Lint(n_samples*0.8):] 

Z_test = X[shuffle][int(n_samples*0O.8):] 


(b) (Data visualization) Using the following code or otherwise, plot randomly 
sampled data points (say 200 random points) among the entire data points 
generated in part (a). 


import matplotlib.pyplot as plt 

# randomly select the number n_s of samples 
n_s = 200 

Xs = X_train[:n_s] 

Ys = Y_train[:n_s] 
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Zs = Z_train[:n_s] 
# choose part of X and Y assiciated with a certain Z 
X_ZO = Xs[Zs==0.0] 
X_Z1 = Xs[Zs==1.0] 
Y_ZO = Ys[Zs==0.0] 
Y_Z1 = Ys[Zs==1.0] 
# plot 
plt.figure(figsize=(14,10)) 
plt.scatter( 
X_ZOLY_ZO==1.0][:,0], X_ZOLY_ZO==1.0][:,1], 
color=’red’,marker='0’,facecolors=’none’, 
s=120, linewidth=1.5, label=’White reoffend’) 
plt.scatter( 
X_ZO[Y_ZO==0.0][:,0], X_ZO[Y_ZO==0.0][:,1], 
color=’blue’,marker=’o0’,facecolors=’none’, 
s=120, linewidth=1.5, label=’White non-reoffend’) 
plt.scatter( 
X_ZI[Y_Z1==1.0][:,0], X_Z1[Y_Z1==1.0][:,1], 
color=’red’,marker='0’,facecolors=’black’, 
s=120, linewidth=1.5, label=’Black reoffend’) 
plt.scatter( 
X_ZI[Y_Z1==0.0][:,0], X_Z1I[Y_Z1==0.0][:,1], 
color=’blue’,marker=’0’,facecolors=’black’, 
s=120, linewidth=1.5, label=’Black non-reoffend’) 
plt.legend(fontsize=16) 


(c) (Classifier & discriminator) Draw block diagrams for the classifier and the 
discriminator implemented by the following: 


from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import Dense 


classifier=SequentialQ 
classifier.add(Dense(1,input_dim=2, 
activation=’sigmoid’)) 
discriminator=Sequential( 
discriminator.add(Dense(1,inout_dim=1, 
activation=’sigmoid’)) 


(d) (Optimizers and loss functions) Explain how the optimizers and loss func- 
tions for the discriminator and the classifier are implemented in the follow- 
ing code. Also draw a block diagram for the GAN model implemented as 
the name of gan. 

from tensorflow.keras.layers import Input 


from tensorflow.keras.models import Model 
from tensorflow.keras.optimizers import Adam 
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from tensorflow.keras.optimizers import SGD 
from tensorflow.keras.losses import BinaryCrossentropy 
from tensorflow.keras.layers import Concatenate 


# optimizers for classifier & discriminator 
c_opt=Adam(learning_rate=0.005,beta_1=0.9,beta_2=0.999) 
d_opt=SGD(learning_rate=0.005) 


# define dicriminator loss 
def d_loss(y_true,y_pred): 
CE_loss = BinaryCrossentropy(from_logits=False) 
lamb = 0.1 
return lamb*CE_loss(y_pred,y_true) 
# discriminator compile 
discriminator.compile(loss=d_loss, optimizer=d_opt) 


# define classifier (generator) loss 
def c_loss(y_true,y_pred): 
# y_true[:,O]: Y_train (label) 
# y_true[:1]: Z_train (sensitive attribute) 
# y_pred[.,O]: classifier output G(x) 
# y_pred[.,1]: discriminator output fed by 


# classifier output D(G(x)) 
CE_loss = BinaryCrossentropy(from_logits=False) 
lamb = 0.1 


p_loss = CE_loss(y_pred[:,O],y_true[:,0]) 
f_loss = CE_loss(y_pred[:,1],y_true[:,1]) 
return Cl-lamb)*p_loss - lamb*f_loss 


# define the GAN model 

# input: x 

# output: [G(x), D(GO)] 
discriminator.trainable = False 
gan_input = Input(shape=(2,)) 

Gx = classifierCinputs=gan_input) 
DGx = discriminator(Gx) 

output = ConcatenateQ)([Gx,DGx]) 
gan = Model(gan_input, output) 

# The GAN mode! compile 
gan.compile(loss=c_loss, optimizer=c_opt) 
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(e) (Training) Explain how classifier and discriminator are trained in the fol- 
lowing code: 


import numpy as np 


EPOCHS = 400 

k=2 # k:1 alternating gradient descent 
c_losses = [] 

d_losses = [] 


for epoch in range(1,EPOCHS+1): 
HHHHHHEEHHHHHHAHHAAAEA 
# train discriminator 
HHHHHHHRHHHHHHHHHAAAAHA 
# input for discriminator 
d_input = classifier.oredict(X_train) 
# label for discriminator 
d_label = Z_train 
# train discriminator 
d_loss = discriminator.train_on_batch(d_input,d_label) 


HHHHHHHEHHHHHHRHHAAAHA 
# train Classifier 
HHHHHHHHHHHHHHHHHAAAA 
if epoch % k == O: # train once every k steps 
# label for classifier 
# Ist component: Y_train 
# 2nd component: Z_train (sensitive attribute) 
c_label = np.zeros((lenCY_train),2)) 
c_label[:,0] = Y_train 
c_labelf[:,1] = Z_train 
# train classifier 
c_loss = gan.train_on_batch(X_train,c_label) 


c_losses.append(c_loss) 
d_losses.append(d_loss) 


(f) (Evaluation) Suppose we train classifier and discriminator using the code 
in part (e) with EPOCHS=400. Plot the tradeoff performance between test 
accuracy and DI by sweeping À from 0 to 1. Also include the Python script. 
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A.1 Jupyter Notebook 


Outline Python requires another software platform which serves to play around 
it. The platform is Jupyter notebook. In this section, we will learn about some 
basic stuffs regarding Jupyter notebook. What we are going to do are four folded. 
First we will figure out the role of Jupyter notebook in light of Python. We will 
then investigate how to install the software, as well as how to launch a new file 
for scripting a code. Next, we will study some useful interfaces which enable an 
easy scripting of a Python code. Lastly, we will introduce several shortcuts that are 
frequently used in writing and executing a code. 


What is Jupyter notebook? Jupyter notebook is a powerful tool for writing 
and running a Python code. As you can see below, the key benefit of the tool is that 
we are able to execute each line of the code rather than the entire code. Hence, we 
are particularly benefiting from an easy debugging especially when the code is very 
long. 

a=1 

b=2 

atb 
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There are two typical ways to use Jupyter notebook. The first runs a code based 
on the server (or cloud) machine. The second is via a local machine. Here we will 
present the second way. 


Installation & launch The use of local machine requires the installation of a 
popular software tool, named Anaconda. You can download and install the latest 
version of it via 


https://www.anaconda.com/products/individual 


You can choose one of the Anaconda installers depending on the operating sys- 
tem of your machine. Please see the three versions presented in Fig. A.1. During 
installation, you may encounter some errors. One error that often occurs is regard- 
ing ‘non-ascil character’. To resolve the error, you should make sure that the name of 
your destination folder path for Anaconda does not include any non-ascil character 
like Korean. Another error message that you may see is about permission for access 
to the indicated path. To avoid this, run the Anaconda installer under the ‘run 
as administrator’ mode. The mode can be seen by right-clicking the downloaded 
execution file. 

In order to launch Jupyter notebook, you can use anaconda prompt (for Win- 
dows) or terminal (for mac and linux). The way it works is very simple. You 
can just type Jupyter notebook in the prompt and then press Enter. Then, a 
Jupyter notebook window will pop up accordingly. If the window does not appear 


Anaconda Installers 


Windows ia MacOS @ Linux & 


64-Bit Graphical Installer (457 MB) 64-Bit Graphical Installer (435 MB) 64-Bit (x86) Installer (529 MB) 
32-Bit Graphical Installer (403 MB) 64-Bit Command Line Installer (428 MB) 64-Bit (Power8 and Power9) Installer (279 


MB) 


Figure A.1. Three versions of Anaconda installers. 
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W Anaconda Powershell Prompt (Anaconda3) - a x 


Figure A.2. How to launch Jupyter notebook in the Anaconda prompt. 


Z Jupyter aut | [ Logout 

Files Running Clusters 

Select items to perform actions on them. Upload || New ~ || © 
Cooly my Name | | Last Modified | | File size 
O © sp objects 4 months ago 
O D Contacts 4months ago 
O © Desktop a month ago 
O © Documents 3 hours ago 
O © Downloads 4 months ago 
O © Dropbox a day ago 
O © Favorites 3 months ago 
D D Links 4 months ago 
O D music 4 months ago 
O D OneDrive 3 hours ago 
O D Saved Games 4 months ago 
O D Searches 4 months ago 
O D Videos 4 months ago 
O D gsviewss.ini 4months ago 43B 
O D sti_trace.tog 12 days ago 0B 


Figure A.3. Web browser of a successfully launched Jupyter notebook. 


= jupyter uit | | Logout 
Files Running Clusters 
Select items to perform actions on them unosa [R] = 


Oo ~ m@ Name 4 K 


O O 3D Objects 


O O anaconda3 Text File 
Folder 


Terminal 


a month ago 


Figure A.4. How to create a Jupyter notebook file on the web browser. 


automatically, you can instead copy and paste the URL (indicated by the arrow in 
Fig. A.2) on your web browser, so as to manually open it. If it works properly, you 
should be able to see the window like the one in Fig. A.3. 

Creating a new notebook file is also simple. First navigate a folder in which you 
want to save a notebook file. Next you can click the New tap placed on the top 
right (marked in a blue box) and then click the Python 3 tap (as indicated in a red 
box). See Fig. A.4 for the location of the taps. 
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se ju pyter Untitled Last Checkpoint: an hour ago (unsaved changes) 


File Edit View Insert Cell Kernel Widgets Help 


[E] 
$ 
op 
& 
a 
> 
¢ 
v 
2 


Interrupt Ij v B 
Restart 


Restart & Clear Output 


In [15]: import math Restart & Run All 


import matplotlib.p[ Reconnect 
import numpy as np 
Shutdown 


In [6]: print("The basic st 
Change kernel 
The basic structure - = 


Figure A.5. Kernel is a computational engine which runs the code. There are several 
relevant functions under the Kernel tap. 


= jJupyter jupyter_notebook Last Checkpoint: an hour ago (unsaved changes) 


File Edit View Insert Cell Kernel Widgets Help 


i 


+ xs &® KR 4A & PRN HC PD? | Markdown v 
Code 


In [3]: a=1 Raw NBConvert 
Heading 


Figure A.6. How to choose the Code or Markdown option in the edit mode. 


Interface In Jupyter notebook, there are two key components required to run a 
code. The first is a computational engine which does execute the code. The engine is 
named Kernel and it can be controlled via several functions provided in the Kernel 
tab. See Fig. A.5 for details. 

The second component is an entity, called cell, in which you can write a script. 
The cell consists of two modes. The first is so called the edit mode which allows you 
to type either: (i) a code script for running a program; or any text like a normal text 
editor. The code script can be written under the Code tap (marked in a red box) as 
illustrated in Fig. A.6. Text-editing can be done under the Markdown tap, marked 
in a blue box. The other mode is the one, named the command mode. Under this 
mode, we can edit a notebook as a whole. For instance, we can copy or delete some 
cells, and move around cells under the command mode. 


Shortcuts There are many shortcuts that are quite instrumental in editing and 
navigating a notebook. Here we emphasize three types of shortcuts frequently used. 
The first is a set of the shortcuts for changing a state across the edit and command 
modes. We type Esc for changing from the edit to command modes. We use Enter 
for the other way around. The second is for inserting or deleting a cell. The shortcut, 
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a, is for inserting a new cell above the current cell. Another shortcut b plays a similar 
role, yet inserting it below the current cell. d+d is for deleting a cell. Notice that 
these should be typed under the command mode to serve proper roles. The last 
is a set of the shortcuts for executing a cell. Arrow keys are used to move around 
distinct cells. Shift + Enter is for running the current cell (and move to the next 
cell). In order to stay in the current cell (even after execution), we use Ctrl + Enter 
instead. 
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A.2 Basic Python Syntaxes 


Outline In this section, we will learn some basic Python syntaxes required to write 
a script for convex optimization problems of this book’s focus. In particular, three 
basic concepts are emphasized: (i) class; (ii) package; and (iii) function. We will 
also introduce a collection of optimization-related Python packages. 


A.21 Data Structure 


There are two prominent data-structure components in Python: (i) list; and (ii) 
set. 


(i) List: List is a built-in data type which allows us to store multiple elements 
in a single variable. The elements are listed with an order and the list allows for 
duplication. Please see below for some popular use. 


x = [1, 2, 3,4] # construct a simple list 
print(x) 


[1, 2, 3, 4] 


x.append(5) #addanitem 
print(x) 


[1, 2, 3, 4, 5] 


x.popQ # delete an item located in the last 
print(x) 


[1, 2, 3, 4] 


# checking if a particular element exists in the list 
hi lal Se 
print(True) 
if 5 in x: 
print(True) 
else: 
print(False) 


True 
False 
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# A single-line construction of a list 
y = [x for x in range(1,10)] 
print(y) 


[1, 2, 3, 4, 5, 6, 7, 8, 9] 


# Retrieving all the elements through a “for” loop 


for iin x: 
printdi) 

1 

2 

3 

4 


(ii) Set: Set is another built-in data type which plays a similar role as List. Two 
key distinctions are: (i) it is unordered; and (ii) it does not allow for duplication. 
See below for some examples of how to use. 


x = set({1, 2, 5}) # construct a set 
print(f"x: {x}, type of x: {type (x)}”) 


x: {1, 2, 3}, type of x: <class ’set’> 


Here the f in front of strings in the print command tells Python to look at the 
values inside {-}. 


x.add(1) # add an existing item 
print(x) 


{1, 2, 3} 


x.add(4) #addanew item 
print(x) 


{1, 2, 3, 4} 


# checking if a particular element exists in the list 
if lin x: 
print(True) 
if 5 in x: 
print(True) 
else: 
print(False) 


True 
False 
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# Retrieving all the elements through a “for” loop 
for iin x: 
printdi) 


KR WRN 


A.2.2 Package 


We will investigate five popular packages which are particularly instrumental in 
scripting a code for convex optimization problems: (i) math; (ii) random; (iii) 
itertools; (iv) numpy; and (v) scipy.stats. 


(i) math: This module provides a collection of useful math expressions such as 
exponent, log, square root and power. See some relevant examples below. 


import math 
math.exp(l]) # expo 
2.718281828459045 


print(math.log(1, 10)) # /og(x, base) 
print(math.log(math.exp(20))) # base-e logarithm 
print(math.log2(4)) # base-2 logarithm 
print(math.log10(1000)) # base-10 lograithm 


0.0 
20.0 
2.0 
3.0 


print@math.sqrt(16)) # square root 
print(math.pow(2,4)) # x raised to y (same as x**y) 
print(2**4) 


4.0 
16.0 
16 


printC(math.cos(math.pi)) # cosine of x radians 
printCmath.dist({1,2],[3,4])) # Euclidean distance 


-1.0 
2.8284271247461903 
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# The erfO function can be used to compute traditional 
# statistical functions such as the CDF of 
# the standard Gaussian distribution 
def phi(x): 
# CDF of the standard Gaussian distribution 
return (1.0 + math.erf(x/math.sqrt(2.0)))/2.0 


phic) 


0.8413447460685428 


(ii) random: This module yields random number generation. See below for some 
examples. 


import random 


random.randrange(start=1, stop=10, step=1) 
# a random number in range(start, stop, step) 
random.randrange(l0O) # integer from O to 9 inclusive 


5 


# returns random integer n such that a<=n<=b 
random.randint(l, 10) 


7 


ciii) itertools: This package provides a succinct way of searching for all the possi- 
ble cases in many combinatorics-related scenarios. It is particularly useful for solving 
Boolean problems (that we learned in Section 1.5) in brute force. 


from itertools import permutations, combinations 


# generating all permutations of [1, 2, 3] 
p = permutations([], 2, 3]) 


for iin p: 
print Ci) 


C1, 2, 3) 
C1, 3, 2) 
(2, 1, 3) 
(2, 3, 1) 
(3, 1, 2) 
(3, 2, 1) 
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# generating all length-2 combinations of [1, 2, 3] 
c = combinations([l, 2, 3], 2) 


for iiinc: 
print Ci) 


i, 2) 
í, 3) 
2, 3) 


# generating all length-3 combinations of [1, 2, 3, 4, 5] 
c = combinations([], 2, 3, 4, 5], 3) 


foriinc: 
print Ci) 


(1, 2, 3) 
C, 2, 4) 
(1, 2, 5) 
(A, 3, 4) 
C1, 3, 5) 
d, 4, 5) 
(2, 3, 4) 
(2, 3, 5) 
(2, 4, 5) 
(3, 4, 5) 


(iv) numpy: Numpy is the most popular package for handling matrices and vec- 
tors. It offers many useful functions. We list a couple of the prominent functions 
frequently employed. 


(a) numpy.array(): numpy.array() is a specialized array data structure in 
numpy. This differs from Python data type array(). 


import numpy as np 

np.array([1, 2, 3]) # construct an array 
array([1, 2, 3]) 

np.array([[1, 2], [5, 4]]) # construct a 2D array 


array([[1, 2], 
[3, 4]]) 
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X = np.ones((2,2)) 

# construct an all-one matrix with size of 2-by-2 

X = np.zeros((2,2)) 

# construct an all-zero matrix with size of 2-by-2 
print(np.ones_like(x)) 

# all-one matrix with the same shape and type of input 
print(np.zeros_like(x)) 

# all-zero matrix with the same shape and type of input 


[[1. 1.] 
(1. 1.1] 
[LO. O.] 
[O. 0.]] 


(b) numpy.random() This module is designed to perform random sampling 
from various probability distributions. Below we list a few popular examples. To 
learn more, you may want to consult with: 


https://numpy.org/doc/1.16/reference/routines.random.html 


# sampling a number from standard Gaussian distribution 
np.random.normal(loc = O, scale = 1) 

# loc: mean, scale: standard deviation 

np.random.randn() # plays the same role 


-2.5459976698222495 


# sampling multiple numbers as per the standard Gaussian 
np.random.normal(O, 1, size = (2, 2)) 

# Here the size determines the output shape 
np.random.randn(2,2) # plays the same role 


array([[-1.8133258 , -1.01151295], 
[-0.37375747, 0.36005748]]) 


np.random.rand(2,2) # Uniform over [0,1] 


array({[[0.06535694, 0.2507505 ], 
[0.17559137, 0.60967901]]) 


(c) numpy.linalg This package offers many useful linear-algebra related func- 
tions. Here are a few of them. 
from numpy import linalg 


x = np.random.randn(2,2) 
print(linalg.det(x)) # Determinant of a matrix x 
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printClinalg.inv(x)) #/nverse of a matrix x 
printdlinalg.norm(x)) # Matrix or vector norm 
printClinalg.svd(x)) # Singular value decomposition 
printClinalg.eig(x)) # Eigenvalue decomposition 


0.7125655927348966 

[C 0.77007826 -0.38835738] 

[ 2.33455331 0.64504946]] 
1.832010151997132 
(array([[-0.2060815, 0.97853483], 

[ 0.97853483, 0.2060815 ]]), array([1.78814528, 0.39849424)), 
array([[-0.96330981, 0.2683919 ], 

[ 0.2683919 , 0.96330981]])) 
(array([0.50418566+0.67702467), 0.50418566-0.67702467}]), 
array([[0.02479485-0.37684352)j, 0.02479485+0.37684352j], 

[0.92594502+0.} ; 0.92594502-0,j JD) 


(d) resizing: The resizing is often used for transforming the dimension of one 
into another. 


x = np.random.randn(4,4,1) 

y = x.view(dtype=np.float_).reshape(-1,2) 

# ’-1’ can be inferred from the context: Shape of (8,2) 
print¢y) 

Z = x.squeeze() 

print(z.shape) 


[[-0.85719316 2.99692221] 
[116327996 -0.11955541] 
[-0.76229609 0.79871494] 
[ 0.99757568 0.69329723] 
[-1.52198295 -0.74430996] 
[ 017174063 0.25343301] 
[ 0.07151011 -2.90945412] 
[11874155  -0.64209109]] 

(4, 4) 


(v) scipy.stats: This module provides a large collection of probability distribu- 
tions and related statistics. Below we present only a few of them. For more infor- 
mation, you can refer to: 


https://docs.scipy.org/doc/scipy/reference/stats.html 


from scipy import stats 
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# A random variable with the standard Gaussian 
X = stats.norm(loc = O, scale = 1) 

# loc:mean, scale:standard deviation 
print(X.cdf(np.array([-1, O, 1]))) 

# computes the CDF at each numpy array 
print(X.rvs(size = 3)) 

# generating a sequence of random variables 


[0.15865525 0.5 0.84134475] 
[ 0.39460402 -0.8042592 -0.71404882] 


# Another random variable with the uniform distribution 
Y = stats.uniform(loc = O, scale = 1) 

# uniform distribution in [loc, loc + scale] 
printCY.cdf(np.array([-1, O, 0.5, 1]))) 

printCY.rvs(size = 3)) 


[0. O. 0.51. ] 
[0.72953474 0.67879248 0.47947748] 


A.2.3 Visualization 


The most popular function for drawing a graph is matplotlib.pyplot. Here is how 
to use it. 
import matplotlib.pyplot as plt 


x_value = [x for x in range(10)] 
y_value = [y for y in range(10, 20)] 
plt.plot(x_value, y_value) # create a plot 


plt.xlabel( x’) # labeling x-axis 
plt.ylabelCy’) # labeling y-axis 
plt.show 


# No need to use show() in jupyter notebook. 


<function matplotlib.pyplot.show(close=None, block=None)> 


T T T T T 
0 2 4 6 8 
x 


Figure A.7. Plotting a simple function via matplotlib.pyplot. 
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# Drawing multiple graphs in a single plot 


x = [x for x in range(10)] 
y_1= [3*y for y in range(10)] 
y_2 = [2*y for y in range(10)] 


plt.plot(x, y1) #p/ot_] 
plt.plot(x, y_2) #p/lot_2 
plt.xlabel(’x’) # labeling x-axis 
plt.ylabelCy’) # labeling y-axis 
plt.legend([’y=3x’, ’y=2x’]) 


<matplotlib.legend.Legend at Oxld9e45c7aO0> 


25 


Figure A.8. Multiple functions and legend. 
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CVXPY Basics 


Outline There are a couple of high-level software packages specialized for solv- 
ing convex optimization problems. One very convenient and user-friendly pack- 
age is CVX which has been developed by Michael Grant and Stephen P. Boyd. 
While it is very intuitive and therefore easy to learn, it is built upon a non-open- 
source platform MATLAB which requires a license for use. In contrast, there are 
two other packages that run in an open-source platform Python. These are CVXPY 
and scipy.optimize. Among these two, this book focuses on the use of CVXPY 
which we believe is more friendly and is evolving from the contributions of many 
researchers and engineers. In this section, we are going to cover four very basic stuffs 
for CVXPY. First, we will present how to install CVXPY library in Python. Writ- 
ing a CVXPY script consists of three key procedures: (i) defining an optimization 
problem; (ii) calling a solver for the problem; and (iii) obtaining the solution. So 
in the second part, we will figure out how to define an optimization problem based 
on some known concepts (that we have learned thus far) like variables, constraints 
and objective. We will then study how to solve the problem accordingly. For ease 
of illustration, we will demonstrate the whole procedure via a simple example. 


Installation The use of CVXPY requires the installtion of Python 3. For 
Python 3, we can rely upon virtual environment tools like Anaconda which we 
described in Section A.1 for installation. 
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Installing CVXPY library is very simple. We can do it by using the library man- 
ager pip. 
pip install cvxpy 


Collecting cvxpy 

Downloading cvxpy-1.2.0-cp39-cp39-win_amd64.whl (832 kB) 
Collecting ecos>=2Note: 
you may need to restart the kernel to use updated packages. 

Downloading ecos-2.0.10-cp39-cp39-win_amd64.whl (68 kB) 
Collecting scs>=1.1.6 

Downloading scs-3.2.0-cp39-cp39-win_amd64.whl (8.1 MB) 
Requirement already satisfied: numpy>=1.15 in 
c:\programdata\anaconda3\lib\site-packages (from cvxpy) (1.20.3) 
Collecting osqp>=0.4.1 

Downloading osqp-0.6.2.post5-cp39-cp39-win_amd64.whl (278 kB) 
Requirement already satisfied: scipy>=1.1.0 in 
c:\programdata\anaconda3\lib\site-packages (from cvxpy) (1.7.1) 
Collecting qdldl 

Downloading qdldl-0.1.5.p0st2-cp39-cp39-win_amd64.whl (83 kB) 
Installing collected packages: qdldl, scs, osqp, ecos, cvxpy 
Successfully installed cvxpy-1.2.0 ecos-2.0.10 osqp-0.6.2.post5 
qdldl-0.1.5.post2 scs-3.2.0 


The command pip list allows us to check whether CVXPY is successfully 
installed. Please see below the created cvxpy in the middle. 


pip list 

Package Version 
alabaster 0.7.12 
anaconda-client 1.9.0 
anaconda-navigator 211 
anaconda-project 0.10.1 
cvxpy 1.2.0 
jupyter 1.0.0 
jupyter-client 6.1.12 
jupyter-console 6.4.0 


jupyter-core 4.8.1 
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jupyter-server 1.4.1 

jupyterlab 3.2.1 

jupyterlab-pygments 0.1.2 
jupyterlab-server 2.8.2 
jupyterlab-widgets 1.0.0 
zict 2.0.0 
Zipp 3.6.0 
zope.event 4.5.0 
zope.interface 5.4.0 


Alternatively, one can attempt to import CVXPY library. 


import cvxpy 


Without any error, you are ready to use. For any errors that you may encounter yet 
we do not mention here, you may want to consult with the installation guide on: 


https://www.cvxpy.org/install/index.htm! 


How to define an optimization problem In CVXPY, an optimization prob- 
lem is comprised of four components: (i) variables; (ii) constraints; (iii) objective; 
and (iv) parameters. Let us dig into details on how to define them. 

1. Variables: The variables refer to the optimization variables that we have played 
with throughout. We define a scalar variable via cp. Variable(). We construct a vector 
by putting the size inside the parenthesis. We can even create a matrix by indicating 
two numbers inside the parenthesis. But the variable cannot go beyond a 2D matrix, 
so it cannot be a 3D or 4D tensor. Here are some examples. 

import cvxpy as cp 

# Construct a scalar opt. var. with a blank parenthesis 

x = cp.VariableO 

# Can construct another in case we have more variables. 
y = cp.VariableQ 


# Or we can construct a 3-by-1 vector. 

z = cp.Variable(3) 

# Or a matrix with a proper size, say (2,3) 
w = cp.Variable((2,3)) 


2. Constraints: There are two types of constraints: (i) equality; and (ii) inequality 
constraints. We do not rely upon the standard form that we learned in Section 1.2, 
allowing for more flexible descriptions. We use ==, <= and <= symbols to imple- 
ment. Here is an example: 
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constraints = [x+y == 10, x-y <= 3] 


3. Objective: The objective refers to the objective function that we wish to mini- 
mize or maximize. Depending on the optimization type between min and max, we 
use either of the following two: 


obj_min = cp.Minimizec(x-3*y)**2) 
obj_max = cp.Maximize((x-3*y)**2) 


4. Parameters: Parameters indicate the ones that are employed for perturb- 
ing an optimization problem. For instance, we introduce a parameter, say a, 
which plays a multiplication role in front of y in the above minimization, e.g., 
cp.Minimize((x-a*y)**2). The way to introduce the parameter is very simple: 


a = cp.Parameter(nonneg= True) 
a.value = 3 # assigning a value 3 for ‘a‘ 


Here nonneg=True indicates the non-negativity property of the parameter. If it 
is non-positive, we can type it like nonpos=True. A good thing about the use 
of parameters is that the change of an optimization problem does not require re- 
defining it from scratch. 

Once the variables, constraints, objective and parameters are defined as above, 
the last step is to formulate a problem object via cp.Problem(). 


prob = cp.Problem(obj_min, constraints) 


How to solve an optimization problem We can solve a formulated problem 
by calling a solver. The implementation is very simple, requiring a single-line code: 


prob.solveQ) # Solve ‘prob’ and print the optimal value 
16.0 


The command prob.status allows us to check whether or not the derived solu- 
tion is optimal. The optimal value and the corresponding variables can also be 
retrieved with prob.value, x.value and y.value. 

print status: ’, prob.status) 


printCoptimal value: ’, prob.value) 
printCoptimal variables: ’, x.value, y.value) 


status: optimal 
optimal value: 16.0 
optimal variables: 6.5 3.5 
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Here is a script for the entire code that also incorporates the use of the parameter 
a in the objective cp.Minimize((x-a*y)**2). 


x 
y 


cp.VariableO 
cp.VariableQO 


constraints = [x+y == 10, x-y <= 3] 

a = cp.Parameter(nonneg=True) 

a.value = 3 

obj_min = cp.Minimize((x-a*y)**2) 

prob = cp.Problem(obj_min, constraints) 
prob.solveQ) 


printCstatus: ’, prob.status) 
printCoptimal value: ’, prob.value) 
printCoptimal variables: ’, x.value, y.value) 


status: optimal 
optimal value: 16.0 
optimal variables: 6.5 3.5 


Notice that we have exactly the same solution as before. A different value of 
a.value=5 leads to a different solution below. Please check. 


status: optimal 
optimal value: 121.0 
optimal variables: 6.5 3.5 


DOE: 10.1561/9781638280538.ch6 
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Outline One of the machine learning applications that we studied in Part III is 
deep learning. Deep learning is a learning methodology that takes a deep neural 
network (DNN) as a basic model for an interested prediction module. There are 
many prevalent software tools that ease deep learning implementation. Such tools 
are called the machine learning frameworks or application programming interfaces 
(APIs). Examples include TensorFlow, Keras, Pytorch, DL4J, Caffe and mxnet. 
The frameworks exhibit pros and cons, depending on what we pursue in the design 
of a deep learning model like good usability, fast training, functionality, high scal- 
ability in distributed training. This book aims at the great usability aspect, so we 
are going to study the most high-level API with a focus on enabling fast user exper- 
imentation: Keras. The Keras API allows us to go from idea to implementation 
with very few steps. In this appendix, we are going to present four very basic con- 
tents regarding Keras. In fact, Keras is fully integrated with TensorFlow, meaning 
that Keras comes completely packaged with the TensorFlow installation. So in the 
first part, we will learn how to install TensorFlow. Deep learning implementation 
requires three key procedures: (i) preparing and processing data; (ii) building a neu- 
ral network model; and (iii) training a model and testing the trained model. So in 
the second part, we will touch upon some easy way to deal with data, offered in 
Keras. We will then study how to build a neural network model based on some pop- 
ular packages including keras.models and keras.layers. Lastly, we will investigate 
how to train/test a model accordingly. For ease of illustration, we will demonstrate 
the entire procedures via a simple example. 
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Figure C.1. Handwritten digit classification. 


Installation The use of Keras requires the installation of the TensorFlow package. 
The installation procedure is very simple: 


pip install tensorflow 


TensorFlow 2 packages that fully support Keras and hence we will use require 
a pip version higher than 19.0 (or higher than 20.3 for macOS). So you may need 
to install the lastest pip via: pip install --upgrade pip. To figure out whether it is 
successfully installed, one can attempt to import keras as follows. 


from tensorflow import keras 


Without any error, you are ready to use. For any errors that you may encounter, 
you may want to consult with the installation guideline on: 


https://www.tensorflow.org/install 


A simple task of focus A simple task that we will focus on for illustration of 
how to use Keras is handwritten digit classification wherein the goal is to figure 
out a digit from a handwritten image. See Fig. C.1 for a sample image. The figure 
illustrates an instance in which an image of digit 2 is correctly recognized. 


Preparing and processing data There is a popular dataset associated with 
the digit classification task. That is, MNIST (Modified National Institute of Stan- 
dards and Technology) dataset. It contains m = 60,000 training images and 
Mest = 10, 000 testing images. Each image, say x”), consists of 28 x 28 pixels, each 
indicating a gray-scale level raning from 0 (white) to 1 (black). It also comes with 
a corresponding label, say y®, that takes one of the 10 classes y € {0,1,..., 9}. 
See Fig. C.2. 

One great benefit about Keras is that such popular datasets including MNIST 
are stored in a sub-library: keras.datasets. Even more, train and test datasets are 
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Figure C.2. MNIST dataset:An input image is of 28-by-28 pixels, each indicating an inten- 
sity from O (white) to 1 (black); each label with size 1 takes one of the 10 classes from 
O to9: 


already therein with a proper split ratio. So we dont need to worry about how to 
split them. The only script that we should write for importing MNIST is: 

from tensorflow.keras.datasets import mnist 

(X_train, y_train), (X_test, y_test) = mnist.load_dataQ 

X_train = X_train/255. 

X_test = X_test/255. 


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras- 
datasets/mnist.npz 

11493376/11490434 [HSSSsssssssssssssssssssasaa===] - 1s Ous/step 
11501568/11490434 [SSSssssessscsssssssssssasss===] - Is Ous/step 


Here we divide the input (X_train or X_test) by its maximum value 255 for the 
purpose of normalization. This procedure is often done as a part of data prepro- 
cessing. In case keras.datasets does not offer a dataset of our interest, we have to 
know some data preprocessing techniques that require the use of another promi- 
nent library, named pandas. pandas is particularly instrumental in handling .csv 
files. Here we will not explain how to use pandas in detail. If you want to learn 
more, you may want to consult with: 


https://pandas.pydata.org/ 


For data visualization, we employ matplotlib.pyplot. Below is a simple code for 
plotting one sample image. 


import matplotlib.pyplot as plt 


plt.imshow(X_train[O], cmap = ’gray_r’) 
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Figure C.3. A sample image in MNIST dataset. 


plt.colorbarQ 
plt.titleC {}’.format(y_train[O], fontsize=30)) 


Text(0.5, 1.0, ’5’) 


See Fig. C.3 for the output. Here the option cmap = ’gray_r’ is for enabling the 
white background and a black letter. We use cmap = ‘gray’ for the flipped one, 
i.e., a white letter with the black background. The colorbar() function displays the 


color bar on the right as in Fig. C.3. 
We can also plot many images in one figure. Here is an example of displaying 
60 images. 
num_of_images = 60 
for index in range(i,num_of_images+1): 
plt.subplot(6,10, index) 
plt.axisC off’) 
plt.imshow(X_train[index], cmap = ’gray_r’) 


See Fig. C.4 for the output. 


Building a neural network model Let us employ a two-layer neural network 
that we studied in Sections 3.3 and 3.4. Specifically we introduce a hidden layer 
with 500 neurons. See Fig. C.5 for illustration of the network architecture. As we 
learned in Section 3.4, we employ the ReLU activation in the hidden layer, and 
softmax activation in the output layer. 

Keras includes two major packages: 


(i) tensorflow.keras.models; 


Cii) tensorflow.keras.layers. 
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Figure C.4. Plotting many image samples in a single figure. 
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Figure C.5. A two-layer fully-connected neural network where input size is 28 x 28 = 
784, the number of hidden neurons is 500 and the number of classes is 10. We employ 
ReLU activation for the hidden layer, and softmax activation for the output layer. 


The models package contains several functionalities regarding a neural network 
itself. One major module is Sequential which is a neural network entity and hence 
can be described as a linear stack of layers. The layers package includes many ele- 
ments that constitute a neural network. Examples include fully-connected dense 
layers and activation functions. These two allow us to readily construct a model 
illustrated in Fig. C.5. 


from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import Dense, Flatten 


model = Sequentiald) 
model.add(Flatten(input_shape=(28,28))) 
model.add(Dense(500, activation=’relu’)) 
model.add(Dense(10, activation=’softmax’)) 
model.summary() 


TensorFlow and Keras Basics 353 


Model: "sequential_1" 


Layer (type) Output Shape Param # 
dense (Dense) (None, 500) 392500 
dense_1 (Dense) (None, 10) 5010 


Total params: 397,510 
Trainable params: 397,510 
Non-trainable params: O 


Here Flatten is an entity that indicates a vector expanded from a higher dimen- 
sional one, like a 2D matrix. In this example, a digit image of size 28-by-28 is flat- 
tened into a vector of size 784(= 28 x 28). add is a method for attaching an inter- 
ested layer to the last part in the sequential model. Dense refers to a fully-connected 
layer. The input size is automatically determined by the last part that it will be 
attached to in the sequential model. So the only thing to specify is the number of 
output neurons. In this example, 500 refers to the number of hidden neurons. We 
can also set an activation function with another argument, like activation=’relu’. 
The output layer comes with 10 neurons (coinciding with the number of classes) 
and softmax activation (representing the probability of an output belonging to a 
certain class). The summary) method presents a list of the entire layers specifying 
the size and the number of associated parameters. 


Training a model For training, we need to first set up an algorithm (optimizer) 
to be employed. One popular algorithm that we have often played with is gradi- 
ent descent. But here we will utilize its advanced version that we also learned in 
Section 3.5. That is, the Adam optimizer. The Adam optimizer can be viewed as 
a smart tweak of gradient descent that enables more stable training. As mentioned 
earlier, Adam has three key hyperparameters: (i) the learning rate a; (ii) 61 (captur- 
ing the weight of past gradients); and (iii) 2 (indicating the weight of the square 
of past gradients). The default choice reads: (a, 61, 82) = (0.001, 0.9, 0.999). So 
these values would be set if nothing is specified. 

We also need to specify a loss function. As we learned in Section 3.2 (also via 
Prob 8.2 for the more-than-two class case), the optimal choice in a sense of maxi- 
mizing likelihood is cross entropy. A performance metric that we will look at during 
training and testing can also be specified. One metric frequently employed is accu- 
racy. One can set all of these via another method compile. 
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model.compile(optimizer='adam’, 
loss=’sparse_categorical_crossentropy’, 
metrics=[’acc’]) 


Here the option optimizer=’adam’ sets the default choice of the learning rate and 
betas. For a manual choice, we first define: 


opt=tensorflow.keras.optimizers.Adam( 
learning_rate=0.01, 
beta_1 = 0.92, 
beta_2 = 0.992) 


We then replace the above option with optimizer=opt. The option loss= 
’sparse_categorical_crossentropy’ means the use of the cross entropy loss. 

Now we can bring this to train the model on MNIST data. During training, we 
often employ a part of the entire examples to compute a gradient of a loss function. 
The part is called batch. Two more terminologies. One is the step which refers to a 
loss computation procedure spanning the examples only in a single batch. The other 
is the epoch which refers to the entire procedure associated with all the examples. 
In our experiment, let us use the batch size of 64 and the number 20 of epochs. 


history = model.fit(X_train, y_train, batch_size=64, epochs=20) 


Epoch 1/20 

938/938 [==================] - 2s 2ms/step - loss: 0.0025 - acc: 0.9992 
Epoch 2/20 

938/938 [==================] - 2s 2ms/step - loss: 0.0059 - acc: 0.9981 
Epoch 3/20 

938/938 [==================] - 2s 2ms/step - loss: 0.0031 - acc: 0.9990 
Epoch 4/20 

938/938 [==================] - 2s 2ms/step - loss: 0.0074 - acc: 0.9976 
Epoch 5/20 

938/938 [==================] - 2s 2ms/step - loss: 0.0025 - acc: 0.9993 
Epoch 6/20 

938/938 [==================] - 2s 2ms/step - loss: 0.0043 - acc: 0.9984 
Epoch 7/20 

938/938 [==================] - 2s 2ms/step - loss: 0.0044 - acc: 0.9984 
Epoch 8/20 

938/938 [===============] - 2s 2ms/step - loss: 0.0010 - acc: 0.9998 
Epoch 9/20 

938/938 [============] - 2s 2ms/step - loss: 1.2813e-04 - acc: 1.0000 
Epoch 10/20 

938/938 [============] - 2s 2ms/step - loss: 3.5169e-05 - acc: 1.0000 


Epoch 11/20 
938/938 [=============] - 2s 2ms/step - loss: 2.1899e-05 - acc: 1.0000 
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Epoch 12/20 


938/938 [=============] - 2s 2ms/step - loss: 1.6756e-05 - acc: 1.0000 
Epoch 13/20 

938/938 [=============] - 2s 2ms/step - loss: 1.2778e-05 - acc: 1.0000 
Epoch 14/20 

938/938 [=============] - 2s 2ms/step - loss: 9.8947e-06 - acc: 1.0000 
Epoch 15/20 

938/938 [===============] - 2s 2ms/step - loss: 0.0082 - acc: 0.9981 
Epoch 16/20 

938/938 [===============] - 2s 2ms/step - loss: 0.0090 - acc: 0.9971 
Epoch 17/20 

938/938 [===============] - 2s 2ms/step - loss: 0.0016 - acc: 0.9995 
Epoch 18/20 

938/938 [=============] - 2s 2ms/step - loss: 3.9583e-04 - acc: 0.9999 
Epoch 19/20 

938/938 [=============] - 2s 2ms/step - loss: 7.6672e-05 - acc: 1.0000 
Epoch 20/20 

938/938 [=============] - 2s 2ms/step - loss: 2.4958e-05 - acc: 1.0000 


One good thing about the fitO function is that it returns a dictionary of the 
metrics collected during training. We can check the collected metrics via: 


# list all data in history object 
printchistory.history.keysQ) 


dict_keys(['loss’, ’acc’]) 


Using this data, we can also plot an accuracy curve as a function of epochs. 


plt.plotchistory.history[’acc’]) 
plt.title” model accuracy’) 
plt.xlabelC epoch’) 

plt.ylabel( accuracy’) 


Text O, 0.5, ’accuracy’) 


Testing the trained model For testing, we first need to make a prediction from 
the model ouput. To this end, we use the predicto function as follows: 


model.predict(X_test).argmax(1) 
array([7, 2, 1, .., 4, 5, 6], dtype=int64) 


Here argmax(1) returns the class w.r.t. the highest softmax output among the 
10 classes. In order to evaluate the test accuracy, we use the evaluate() function: 


model.evaluate(X_test, y_test) 
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Figure C.6. Accuracy as a function of epochs. 


313/313 [================] - Os 751us/step - loss: 0.1001 - acc: 0.9847 


[0.10007859766483307, 0.9847000241279602] 


Saving and loading Saving the trained model and loading the saved model later 
is very simple. See below. 


model.save(’saved_classifier’) 


INFO:tensorflow:Assets written to: saved_classifier\assets 


import tensorflow 
loaded_model = tensorflow.keras.models.load_model( 
*saved_classifier’) 
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